Re: ScrollId doesn't advance with 2 indexes on a read alias 1.4.4

2015-04-14 Thread Todd Nine
Correct, I'm always getting the first entity back with the scroll id.  This
doesn't happen if I remove index A from the alias.  I then get the results
I expect.

On Tue, Apr 14, 2015 at 12:05 PM, Roger de Cordova Farias <
roger.far...@fontec.inf.br> wrote:

> Are you sure that calling the same scroll_id won't return the next results?
>
> AFAIK,  the scroll_id can be the same and still return new records
>
> 2015-04-14 14:26 GMT-03:00 Todd Nine :
>
>> Hey guys,
>>   I have 2 indexes.  I have a read alias on both of the indexes (A and
>> B), and a write alias on 1 (B).   I then insert 10 documents to the write
>> alias which inserts them into index B.  I perform the following query.
>>
>> {
>>   "from" : 0,
>>   "size" : 1,
>>   "post_filter" : {
>> "bool" : {
>>   "must" : {
>> "term" : {
>>   "edgeSearch" :
>> "4cd2ba95-e2c9-11e4-bb39-c6c6eebe8d56_application__4cd2ba96-e2c9-11e4-bb39-c6c6eebe8d56_owner__users__SOURCE"
>> }
>>   }
>> }
>>   },
>>   "sort" : [ {
>> "fields.double" : {
>>   "order" : "asc",
>>   "nested_filter" : {
>> "term" : {
>>   "name" : "ordinal"
>> }
>>   }
>> }
>>   }, {
>> "fields.long" : {
>>   "order" : "asc",
>>   "nested_filter" : {
>> "term" : {
>>   "name" : "ordinal"
>> }
>>   }
>> }
>>   }, {
>> "fields.string.exact" : {
>>   "order" : "asc",
>>   "nested_filter" : {
>> "term" : {
>>   "name" : "ordinal"
>> }
>>   }
>> }
>>   }, {
>> "fields.boolean" : {
>>   "order" : "asc",
>>   "nested_filter" : {
>> "term" : {
>>   "name" : "ordinal"
>> }
>>   }
>> }
>>   } ]
>> }
>>
>> I receive my first record, and a scroll id, as expected.
>>
>> On my next request, I perform a  request with the the scroll Id from the
>> first response.
>>
>> What I expect:  I expect to receive my second record, and a new scrollId.
>>
>> What I get:  I get the first record again, with the same scroll Id.
>>
>> I'm on a 1.4.4 server, with a 1.4.4 node client running locally
>> integration testing.
>>
>>
>> When I use the same logic on a read alias with a single index, I do not
>> experience this problem, so I'm reasonably certain my client is coded
>> correctly.
>>
>>
>> Any ideas?
>>
>> Thanks,
>> Todd
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/8a317e5f-eb6f-4aef-a257-3902d31c3567%40googlegroups.com
>> <https://groups.google.com/d/msgid/elasticsearch/8a317e5f-eb6f-4aef-a257-3902d31c3567%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/2ZdScTQ6UkA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAJp2530z-FubTUaxd7rRUDLHahRD3SD1Dz1r7nhF9PJkycdsuA%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAJp2530z-FubTUaxd7rRUDLHahRD3SD1Dz1r7nhF9PJkycdsuA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf-T0uJKnB8fRBQ24Yda2Yp_yGUYCJ4jvsqo4yyAb6NxRA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


ScrollId doesn't advance with 2 indexes on a read alias 1.4.4

2015-04-14 Thread Todd Nine
Hey guys,
  I have 2 indexes.  I have a read alias on both of the indexes (A and B), 
and a write alias on 1 (B).   I then insert 10 documents to the write alias 
which inserts them into index B.  I perform the following query.

{
  "from" : 0,
  "size" : 1,
  "post_filter" : {
"bool" : {
  "must" : {
"term" : {
  "edgeSearch" : 
"4cd2ba95-e2c9-11e4-bb39-c6c6eebe8d56_application__4cd2ba96-e2c9-11e4-bb39-c6c6eebe8d56_owner__users__SOURCE"
}
  }
}
  },
  "sort" : [ {
"fields.double" : {
  "order" : "asc",
  "nested_filter" : {
"term" : {
  "name" : "ordinal"
}
  }
}
  }, {
"fields.long" : {
  "order" : "asc",
  "nested_filter" : {
"term" : {
  "name" : "ordinal"
}
  }
}
  }, {
"fields.string.exact" : {
  "order" : "asc",
  "nested_filter" : {
"term" : {
  "name" : "ordinal"
}
  }
}
  }, {
"fields.boolean" : {
  "order" : "asc",
  "nested_filter" : {
"term" : {
  "name" : "ordinal"
}
  }
}
  } ]
} 

I receive my first record, and a scroll id, as expected.

On my next request, I perform a  request with the the scroll Id from the 
first response. 

What I expect:  I expect to receive my second record, and a new scrollId.

What I get:  I get the first record again, with the same scroll Id.

I'm on a 1.4.4 server, with a 1.4.4 node client running locally integration 
testing.


When I use the same logic on a read alias with a single index, I do not 
experience this problem, so I'm reasonably certain my client is coded 
correctly.


Any ideas?

Thanks,
Todd

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8a317e5f-eb6f-4aef-a257-3902d31c3567%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Creating a no-op QueryBuilder and FilterBuilder

2015-04-06 Thread Todd Nine
Hey all,
  I have a bit of an odd question, hopefully someone can give me an answer. 
 In usergrid, we have our own existing query language.  We parse this 
language into an AST, then visit each of the nodes and construct an ES 
query with the java client.  So far, very straight forward.  However, we 
have 1 wrench in the works.  Some of our operations can only be implemented 
as filters.  As a result, visiting some of our nodes results in a 
QueryBuilder operation, others in a FilterBuilder operation.  To keep our 
context so that we still evaluate AND, OR, and NOT correctly, I'm pushing 
BoolQueryBuilder and BoolFilterBuilder on to our stacks as we traverse the 
tree.  Within these bool terms, I will have both query and filter 
operations that represent a "no-op" operation.  Is there such a build that 
already exists in the client, or will I need to create my own that doesn't 
append any content during the build phase?

Thanks,
Todd

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/634d71c1-c03e-476c-a293-f5114cfe35cf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Set index.query.parse.allow_unmapped_fields = false for a single index

2015-04-02 Thread Todd Nine
Figured this out (derp).  So my next question, is how can I disable dynamic 
mappings so that they don't occur?

Thanks,
Todd

On Thursday, April 2, 2015 at 3:55:07 PM UTC-6, Todd Nine wrote:
>
> Hey guys,
>   We're running ES as a core service, so I can't guarantee that my 
> application will be the only one using the cluster.  Is it possible to 
> disallow unmapped fields explicitly for an index without modifying the 
> global setting?
>
> Thanks,
> Todd
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/29f0c5e3-068d-4e19-9a3b-6cd38ec60e08%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Set index.query.parse.allow_unmapped_fields = false for a single index

2015-04-02 Thread Todd Nine
Hey guys,
  We're running ES as a core service, so I can't guarantee that my 
application will be the only one using the cluster.  Is it possible to 
disallow unmapped fields explicitly for an index without modifying the 
global setting?

Thanks,
Todd

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/19b5a2f9-88f4-4100-809d-3e258e24f6af%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Using scroll and different results sizes

2015-03-24 Thread Todd Nine
Hey all,
  We use ES as our indexing sub system.  Our canonical record store is 
Cassandra.  Due to the denormalization we perform for faster query speed, 
occasionally the state of documents in ES can lag behind the state of our 
Cassandra instance.  To accommodate this eventually consist system, our 
query path is the following.


Query ES ->  returns scrollId and document ids-> Load entities from 
Cassandra -> Drop "stale" documents from results and asynchronously remove 
them.


If the user requested 10 entities, and only 8 were current, we will be 
missing 2 results and need to make another trip to ES to satisfy the result 
size.  Is it possible to do the following?


//get first page with 10 results
POST /testindex/mytype/_search?scroll=1m {"from":0, "size": 10} 

POST /testing/mytype/_search?scroll_id=  {"size":2}


I always seem to get back 10 on my second request.  If not other option is 
available, I'd rather truncate our result set size to < the user's 
requested size than take the performance hit of using from and size to drop 
results.  Our document counts can be 50million+, so I'm concerned with the 
performance implications of not using scroll.


Thoughts?


Thanks,
Todd




-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4a6cf1a9-9b96-47c2-a3f6-0955b3e74283%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Very slow index speeds with dynamic mapping and large volume of documents with new fields

2015-03-11 Thread Todd Nine
Hey all,
  We're bumping up against a production problem I could use a hand with. 
 We're experiencing steadily decreasing index speeds.  We have 12 c3.4xl 
data nodes, and 1 c3.8xl master node (with 2 backups that are smaller). 
 We're indexing 45 million documents into a single index.  Single shard 
only, no replicas.   As our number of documents grow, our indexing speed 
slows to a crawl.  We've applied all the standard mlockall, ulimit, and ssd 
merge throttling tuning settings, so I feel our cluster is pretty good.  

When I inspected the data, I've noticed our user is adding a new field on 
every document.  When I view the pending tasks on our master, the task 
queue is always at least 300+ attempting to perform dynamic mapping.  I've 
also checked segment merging, we never have more than 1 merge going on, and 
even then it lasts for a second or two, not long at all.


This brings me to my question.  When dynamic mapping is performed, is this 
on the master only?  Obviously this would introduce a bottleneck, and 
explain our sudden performance drop.  I'm at a loss to explain this issue. 
 Any advice would be appreciated.

Thanks,
Todd

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0611317c-d3c1-4894-8fac-8ac4b36cbf15%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Unexpected shard allocation with 0 replica indexes

2015-02-23 Thread Todd Nine
Thanks Mark,
  I've looked at these, as well as these configuration parameters.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-update-settings.html#_balanced_shards

However, I've not been able to find a configuration that actually 
accomplishes what we're looking for.  We're using the aws plugin as well, 
which doesn't seem to help any with our shard allocation.  It does however, 
ensure we have replica's in other zones, which is obviously helpful for 
redundancy.

Are there any known settings, either with allocation-awareness or cluster 
settings that will accomplish the behavior I'm looking for?  Otherwise, 
we'll have to write  client API to move the shards after index allocation 
to ensure we get our even load.

Thanks,
Todd

On Sunday, February 22, 2015 at 3:25:28 PM UTC-7, Mark Walkom wrote:
>
> It's not a bug, ES allocates based on higher level primary shard count and 
> doesn't take into account what index a shard may belong to, this is where 
> allocation awareness and routing comes into play.
>
> Take a look at 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html#allocation-awareness
>
> On 23 February 2015 at 06:46, Todd Nine > 
> wrote:
>
>> Hi All,
>>   We have several indexes in our ES cluster.  ES is not our canonical 
>> system record, we use it only for searching. 
>>
>> Some of our applications have very high write throughput, so for these we 
>> allocate a singular primary shard for each of our nodes.  For example, we 
>> have 6 nodes, and we create our index with 6 shards and 0 replicas.  I 
>> would expect ES to balance  primary shards per node in the index.  However, 
>> it will load as many as 3 primaries on a single node, and some nodes will 
>> have no shards.  While each node does appear to have the same number of 
>> primaries, not all our indexes have equal write throughput, therefore, I 
>> want the primary (and only) shards evenly distributed per index.
>>
>> During heavy writes, we experience slowness because our shards are not 
>> evenly distributed across our nodes.   Currently we're getting around this 
>> by manually moving shards, but this seems to be something ES should do 
>> automatically.  We're running version 1.4.3 with default allocation 
>> parameter.  Is this a bug? 
>>
>> Thanks,
>> Todd
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/5cc33cd3-f7f6-4174-b815-cadc858c0b97%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/5cc33cd3-f7f6-4174-b815-cadc858c0b97%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8a6b76c7-87f0-4a2f-a08f-db9a6fb871f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Unexpected shard allocation with 0 replica indexes

2015-02-22 Thread Todd Nine
Hi All,
  We have several indexes in our ES cluster.  ES is not our canonical 
system record, we use it only for searching. 

Some of our applications have very high write throughput, so for these we 
allocate a singular primary shard for each of our nodes.  For example, we 
have 6 nodes, and we create our index with 6 shards and 0 replicas.  I 
would expect ES to balance  primary shards per node in the index.  However, 
it will load as many as 3 primaries on a single node, and some nodes will 
have no shards.  While each node does appear to have the same number of 
primaries, not all our indexes have equal write throughput, therefore, I 
want the primary (and only) shards evenly distributed per index.

During heavy writes, we experience slowness because our shards are not 
evenly distributed across our nodes.   Currently we're getting around this 
by manually moving shards, but this seems to be something ES should do 
automatically.  We're running version 1.4.3 with default allocation 
parameter.  Is this a bug? 

Thanks,
Todd

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5cc33cd3-f7f6-4174-b815-cadc858c0b97%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: One shard continually fails to allocate

2015-02-18 Thread Todd Nine
Hey Aaron,
  What do you get back if you try to use these sets of commands to manually 
allocate the shard to a node?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-reroute.html

I had this problem before, but it turned out we had 1 node that had 
accidentally be upgraded, and the rest were still on a previous version.   
I was able to determine this be reading the error output from the shard 
allocation command. 


Todd

On Tuesday, February 17, 2015 at 11:48:48 PM UTC-8, David Pilato wrote:
>
> What gives 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cat-shards.html#cat-shards
> ?
>
> --
> David ;-)
> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>
> Le 18 févr. 2015 à 06:44, Aaron C. de Bruyn  > a écrit :
>
> All the servers have nearly 1 TB free space.
>
> -A
>
> On Tue, Feb 17, 2015 at 7:44 PM, David Pilato  > wrote:
>
> It's a replica?
>
> Might be because you are running low on disk space?
>
>
> --
>
> David ;-)
>
> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>
>
> Le 18 févr. 2015 à 01:16, Aaron de Bruyn > 
> a écrit :
>
>
> I have one shard that continually fails to allocate.
>
>
> There is nothing in the logs that would seem to indicate a problem on any 
> of
>
> the servers.
>
>
> The pattern of one of the copies of shard '2' not being allocated runs
>
> throughout all my logstash indexes.
>
>
> Running 1.4.3 on all nodes.
>
>
> Any pointers on what I should check?
>
>
> Thanks,
>
>
> -A
>
>
> --
>
> You received this message because you are subscribed to the Google Groups
>
> "elasticsearch" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an
>
> email to elasticsearc...@googlegroups.com .
>
> To view this discussion on the web visit
>
>
> https://groups.google.com/d/msgid/elasticsearch/e1a6111e-70a3-412a-a666-da61c479ee53%40googlegroups.com
> .
>
> For more options, visit https://groups.google.com/d/optout.
>
>
> 
>
>
> --
>
> You received this message because you are subscribed to a topic in the
>
> Google Groups "elasticsearch" group.
>
> To unsubscribe from this topic, visit
>
> https://groups.google.com/d/topic/elasticsearch/iB--QW6ds-Y/unsubscribe.
>
> To unsubscribe from this group and all its topics, send an email to
>
> elasticsearc...@googlegroups.com .
>
> To view this discussion on the web visit
>
>
> https://groups.google.com/d/msgid/elasticsearch/E00B59E6-1D1C-4727-AD0F-ABA5291D0E56%40pilato.fr
> .
>
> For more options, visit https://groups.google.com/d/optout.
>
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearc...@googlegroups.com .
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/CAEE%2BrGqur9RWW%2BorDGDCji9cUvp7Y2XmcT8D4M%3DCLx7%2BkiWofg%40mail.gmail.com
> .
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/91d7e258-6886-4344-b990-5ca4d0b2888c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Uneven primary shard distribution with increasing index allocation

2015-02-17 Thread Todd Nine
Hey All,
  We're on ES 1.4.3.  Initially, I thought I had this issue, however it 
appears I was incorrect.

https://github.com/elasticsearch/elasticsearch/issues/9023

We're seeing very uneven primary shard distribution as we're adding new 
indexes to our system.  We're running a 6 node cluster.  We're adding 
indexes with 12 shards, and 1 replica.  What I'm noticing is that 1/4 (3) 
of the primary shards are be allocated to a single node.  In the case of 
our application, this is causing severe performance problems since we're 
writing 1/3 of our throughput to a single node, leaving the remaining 2/3 
spread over 5 nodes.  We currently have 87 indexes and growing.  I never 
noticed this with a handful of indexes, but as we're getting more it's 
becoming more prevalent.

What can I do to diagnose and fix this issue?  From the screenshot above, 
I've tried moving the primary shards, but this seems to have little effect.

When catting the shards and checking how many primaries exist per node, 
they're very skewed

curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res001sy|wc -l 
=> 259
curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res002sy|wc -l 
=> 228
curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res003sy|wc -l 
=> 191
curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res004sy|wc -l 
=> 106
curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res005sy|wc -l 
=> 37
curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res006sy|wc -l 
=> 13

As you can see, 01 is our highest with 259 primary shards, however 06 only 
had 13.

I've tried forcing shards to be rebalance with the command below but it 
seems to have no effect.  

Also, note that we're running the aws cloud plugin. 
 https://github.com/elasticsearch/elasticsearch-cloud-aws.

curl -XPOST 'res002sy:9200/_cluster/reroute' -d '{}'


Any ideas what I can do to resolve this problem?  The uneven distribution 
seems to killing our more heavily nodes.since they're taking a majority of 
the primary shard throughput.  I'd prefer that most nodes have an 
approximately equal amount of primaries and secondaries, since our indexes 
are mixes of more write heavy or more read heavy.  We seem to have very few 
that are 50/50.

Thanks,
Todd




-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7c935128-e783-48cc-8134-3bc2de0c9aaa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Topology of tribe nodes in a distributed replicated environment and automatic index + alias creation.

2015-02-11 Thread Todd Nine
Hey guys,
  We have a slightly different use case than I'm able to find examples for 
with the tribe node.  Any feedback would be appreciated.

What we have now:

Single region:

We create indexes in our code automatically.  They're based on timeuuids, 
so we never have to worry about them conflicting.  We also create read and 
write aliases to these indexes.
We read and write documents to our local region only.


What we want:

2+ Region Topology

us-west, us-east, asia pac

What I'm envisioning.

Our Code:  We implement a tool that will create a log of all index 
creations, as well as read and write alias creation.  These operations are 
then replayed individually on each region's cluster.  If a region is 
offline, or unreachable, it won't block other regions getting created.


Tribe Nodes:  Write to all regions.  Is there a way we can do this so that 
this returns to the caller after our local region has accepted the write? 
 We don't want to wait for all regions to ack the write.  Again if a region 
is offline, we'll lose that ability.  We can re-index those documents later.

Read:  Always read from local cluster for efficiency and responsiveness.

We're exploring all options for multi region deployments.  Ultimately we'd 
like reads to occur locally for performance, and are ok with queueing 
background writes to each region.


Thoughts?

Thanks,
Todd






-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e6dcf03c-b023-4118-ac9b-4cf6701e4885%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Help creating a near real time streaming plugin to perform replication between clusters

2015-01-23 Thread Todd Nine
Thanks for the suggestion on the tribe nodes.  I'll take a look 
at  org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService 
more in depth.  A reference implementation would be helpful in 
understanding it's usage, do you happen to know of any projects that use 
it?  

>From an architecture perspective, I'm concerned with having the cluster 
master initiate any replication operations aside from replaying index 
modifications.  As we continue to increase our cluster size, I'm worried it 
may become too much load on the master to keep up.  Our system is getting 
larger every day, we have 12 c3.4xl instances in each region currently. 
 Our client to ES is a multi-tennant system 
(http://usergrid.incubator.apache.org/), so each application created in the 
system will get it's own indexes in ES. This allows us to scale the indexes 
using read/write aliases per each application's usage. 


To take a step back even further, is there a way we can use something 
existing in ES to perform this work, possibly with routing rules etc?  My 
primary concern is that we don't have to query across regions, can recover 
from a region or network outage, and that replication can begin once 
communication between regions is restored.  I seem to be venturing into 
uncharted territory here by thinking I need to create a plugin, and I doubt 
I'm the first user to encounter such a problem.   If there are any other 
known solutions, that would be great.  I just need our replication time to 
be every N seconds.

Thanks again!
Todd

On Friday, January 23, 2015 at 2:43:08 PM UTC-7, Jörg Prante wrote:
>
> This looks promising.
>
> For admin operations, see also the tribe node. A special 
> "replication-aware tribe node" (or maybe more than one tribe node for 
> resiliency) could supervise the cluster-to-cluster replication.
>
> For the segment strategy, I think it is hard to go down to the level of 
> the index store and capture the files properly and put it over the wire to 
> a target. It should be better to replicate on shard level. Maybe by reusing 
> some of the code 
> of org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService so 
> that a tribe node can trigger a snapshot action on the source cluster 
> master, open a transactional connection from a node in the source cluster 
> to a node in the target cluster, and place a restore action on a queue on 
> the target cluster master, plus a rollback logic if shard transaction 
> fails. So in short, the ES cluster to cluster replication process could be 
> realized by a "primary shard replication protocol".
>
> Just my 2¢
>
> Jörg
>
>
> On Fri, Jan 23, 2015 at 7:42 PM, Todd Nine 
> > wrote:
>
>> Thanks for the pointers Jorg,
>>   We use Rx Java in our current application, so I'm familiar with 
>> backpressure and ensuring we don't overwhelm target systems.  I've been 
>> mulling over the high level design a bit more.  A common approach in all 
>> systems that perform multi region replication is the concept of "log 
>> shipping".  It's used heavily in SQL systems for replication, as well as in 
>> systems such as Megastore/HBase.  This seems like it would be the most 
>> efficient way to ship data from Region A to Region B with a reasonable 
>> amount of latency.  I was thinking something like the following.
>>
>> *Admin Operation Replication*
>>
>> This can get messy quickly.  I'm thinking I won't have any sort of 
>> "merge" logic since this can get very different for everyone's use case.  I 
>> was going to support broadcasting the following operations.
>>
>>
>>- Index creation
>>- Index deletion
>>- Index mapping updates
>>- Alias index addition
>>- Alias index removal
>>
>> This can also get tricky because it makes the assumption of unique index 
>> operations in each region.  Our indexes are Time UUID based, so I know we 
>> won't get conflicts.  I won't handle the case of an operation being 
>> replayed that conflicts with an existing index, I'll simply log it and drop 
>> it.  Handlers could be built in later so users could create their own 
>> resolution logic.  Also, this must be replayed in a very strict order.  I'm 
>> concerned that adding this additional master/master region communication 
>> could result in more load on the master.  This can be solved by running a 
>> dedicated master, but I don't really see any other solution.
>>
>>
>> *Data Replication*
>>
>> 1) Store last sent segments, probably in a system index.  Each region 
>> could be offline at different times, 

Re: Help creating a near real time streaming plugin to perform replication between clusters

2015-01-23 Thread Todd Nine
Thanks for the pointers Jorg,
  We use Rx Java in our current application, so I'm familiar with 
backpressure and ensuring we don't overwhelm target systems.  I've been 
mulling over the high level design a bit more.  A common approach in all 
systems that perform multi region replication is the concept of "log 
shipping".  It's used heavily in SQL systems for replication, as well as in 
systems such as Megastore/HBase.  This seems like it would be the most 
efficient way to ship data from Region A to Region B with a reasonable 
amount of latency.  I was thinking something like the following.

*Admin Operation Replication*

This can get messy quickly.  I'm thinking I won't have any sort of "merge" 
logic since this can get very different for everyone's use case.  I was 
going to support broadcasting the following operations.


   - Index creation
   - Index deletion
   - Index mapping updates
   - Alias index addition
   - Alias index removal

This can also get tricky because it makes the assumption of unique index 
operations in each region.  Our indexes are Time UUID based, so I know we 
won't get conflicts.  I won't handle the case of an operation being 
replayed that conflicts with an existing index, I'll simply log it and drop 
it.  Handlers could be built in later so users could create their own 
resolution logic.  Also, this must be replayed in a very strict order.  I'm 
concerned that adding this additional master/master region communication 
could result in more load on the master.  This can be solved by running a 
dedicated master, but I don't really see any other solution.


*Data Replication*

1) Store last sent segments, probably in a system index.  Each region could 
be offline at different times, so for each segment I'll need to know where 
it's been sent.

2) Monitor segments as they're created.  I still need to figure this out a 
bit more in the context of latent sending. 

Example.  Region us-east-1 ES nodes.

We missed sending 5 segments to us-west-1 , and they were merged into 1.  I 
now only need to send the 1 merged segment to us-west-1, since the other 5 
segments will be removed.

However, then a merged segment is created in us-east-1 from 5 segments I've 
already sent to us-west-1, I won't want to ship that since it will already 
contain the data.  As the tree is continually merged, I'll need to somehow 
sort out what contains shipped data, and what contains unshipped data.


3) As a new segment is created perform the following.
  3.a) Replay any administrative operations since the last sync on the 
index to the target region, so the state is current.
  3.b) Push the segment to the target region

4) The region receives the segment, and adds it to it's current segments. 
 When a segment merge happens in the receiving region, this will get merged 
in.





Thoughts?




On Thursday, January 15, 2015 at 5:29:10 PM UTC-7, Jörg Prante wrote:
>
> While it seems quite easy to attach listeners to an ES node to capture 
> operations in translog-style and push out index/delete operations on shard 
> level somehow, there will be more to consider for a reliable solution.
>
> The Couchbase developers have added a data replication protocol to their 
> product which is meant for transporting changes over long distances with 
> latency for in-memory processing.
>
> To learn about the most important features, see
>
> https://github.com/couchbaselabs/dcp-documentation
>
> and
>
> http://docs.couchbase.com/admin/admin/Concepts/dcp.html
>
> I think bringing such a concept of an inter cluster protocol into ES could 
> be a good starting point, to sketch the complete path for such an ambitious 
> project beforehand.
>
> Most challenging could be dealing with back pressure when receiving 
> nodes/clusters are becoming slow. For a solution to this, reactive Java / 
> reactive streams look like a viable possibility.
>
> See also
>
> https://github.com/ReactiveX/RxJava/wiki/Backpressure
>
> http://www.ratpack.io/manual/current/streams.html
>
> I'm in favor of Ratpack since it comes with Java 8, Groovy, Google Guava, 
> and Netty, which has a resemblance to ES.
>
> In ES, for inter cluster communication, there is not much coded afaik, 
> except snapshot/restore. Maybe snapshot/restore can provide everything you 
> want, with incremental mode. Lucene will offer numbered segment files for 
> faster incremental snapshot/restore.
>
> Just my 2¢
>
> Jörg
>
>
>
> On Thu, Jan 15, 2015 at 7:00 PM, Todd Nine 
> > wrote:
>
>> Hey all,
>>   I would like to create a plugin, and I need a hand.  Below are the 
>> requirements I have.
>>
>>
>>- Our documents are immutable.  They are only ever created or 
>>d

Help creating a near real time streaming plugin to perform replication between clusters

2015-01-15 Thread Todd Nine
Hey all,
  I would like to create a plugin, and I need a hand.  Below are the 
requirements I have.


   - Our documents are immutable.  They are only ever created or deleted, 
   updates do not apply.
   - We want mirrors of our ES cluster in multiple AWS regions.  This way 
   if the WAN between regions is severed for any reason, we do not suffer an 
   outage, just a delay in consistency.
   - As documents are added or removed they are rolled up then shipped in 
   batch to the other AWS Regions.  This can be a fast as a few milliseconds, 
   or as slow as minutes, and will be user configurable.  Note that a full 
   backup+load is too slow, this is more of a near realtime operation.
   - This will sync the following operations.  
  - Index creation/deletion
  - Alias creation/deletion
  - Document creation/deletion
   

What I'm thinking architecturally.


   - The plugin is installed on each node in our cluster in all regions
   - The plugin will only gather changes for the primary shards on the 
   local node 
   - After the timeout elapses, the plugin will ship the changelog to the 
   other AWS regions, where the plugin will receive it and process it


Are there any api's I can look at that are a good starting point for 
developing this?  I'd like to do a simple prototype with 2 1 node clusters 
reasonably soon.  I found several plugin tutorials, but I'm more concerned 
with what part of the ES api I can call to receive events, if any.

Thanks,
Todd 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/dff53da5-8a0c-4805-8f97-72844019a79e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Multi region replication help

2015-01-14 Thread Todd Nine
Hey Mark,
  We're looking for a solution that's faster than a snapshot+restore.  We
do follow an eventual consistency model, but we'd like the replication to
be near realtime, so a snapshot and restore would be too slow.  We really
need something that synchronizes between regions with some sort of log or
queue, so that if a region goes down, we can replay all missed operations.
  We were going to put an SQS queue in our application tier, but if this is
something an ES plugin can provide, we would prefer to utilize that over
writing our own.

I was searching the plugin api, but I can't seem to find anything that will
receive post document write/delete events I can hook into.  Ideally, I
would like each node to queue up changes to primary shards.  This way, we
can distribute sending the changes among all our nodes and we won't have a
single point of failure.

Thanks,
Todd

On Wed, Jan 14, 2015 at 3:03 PM, Mark Walkom  wrote:

> You could use snapshot and restore, or even Logstash.
>
> On 15 January 2015 at 10:07, Todd Nine  wrote:
>
>> Hi all,
>>   We have a deployment scenario I can't seem to find any examples of, and
>> any help would be greatly appreciated.  We're running ElasticSearch in 3
>> AWS regions.  We want these regions to survive a failure from other
>> regions, and we want all writes and reads from our clients to occur in the
>> local region.  Rather than have 1 large cluster that spans 3 regions, I
>> would like the ability to asynchronously replicate document creation and
>> deletion across the regions.  Our application creates immutable documents,
>> so update semantics don't apply.  Is there any way this can be accomplished
>> currently?  Note that I don't think a river will provide us with the
>> partition we need since it is a singleton.
>>
>> Thanks!
>> Todd
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/874b63bf-4cf5-44c9-923d-33134cd4d234%40googlegroups.com
>> <https://groups.google.com/d/msgid/elasticsearch/874b63bf-4cf5-44c9-923d-33134cd4d234%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/nYv8slw-0j4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_fs2ykJsqvL9GVOn9dN-snh_dN0KkggtGH9mSOj0ha9w%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_fs2ykJsqvL9GVOn9dN-snh_dN0KkggtGH9mSOj0ha9w%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf9HzLB%3Ds9NnhCv0bRBXGdjFc%3DJqd_0MJJdR1vmHOH3USA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Multi region replication help

2015-01-14 Thread Todd Nine
Hi all,
  We have a deployment scenario I can't seem to find any examples of, and 
any help would be greatly appreciated.  We're running ElasticSearch in 3 
AWS regions.  We want these regions to survive a failure from other 
regions, and we want all writes and reads from our clients to occur in the 
local region.  Rather than have 1 large cluster that spans 3 regions, I 
would like the ability to asynchronously replicate document creation and 
deletion across the regions.  Our application creates immutable documents, 
so update semantics don't apply.  Is there any way this can be accomplished 
currently?  Note that I don't think a river will provide us with the 
partition we need since it is a singleton.   

Thanks!
Todd

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/874b63bf-4cf5-44c9-923d-33134cd4d234%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Internal implementation details when using geo hash

2014-11-14 Thread Todd Nine
Hey All,
  I have a question about the internal implementation of geo hashes and 
distance filters.  Here is my current understanding, I'm struggling to 
figure out how to apply these to our queries internally in ES.


Using bool queries are very efficient.  Internally they 
perform bitmap union, intersection, and subtraction for very fast candidate 
aggregation per term. 

Geo distance filters are then run on the results of the candidates from the 
bitmap logic.  Each document must be evaluated individually in memory. 
 Obviously for large documents sets from the bitmap evaluation, this is 
inefficient.  



What happens when someone only gives our application a geo distance query? 
 To make this more efficient, I would like to use geo hashing.  ES seems to 
have geo hashing built in, but it's documented as  filter.  For instance, I 
envision the following workflow internally in ES.


1) User searches for all matches within 2k of their current location
2) Use a geohash to create a hash that will encapsulate all points within 
2k of their current location
3) Use the bool query with this geo hash to narrow the candidate result set
4) Apply the distance filter to these candidates to get more accurate 
results.



However, when reading the documentation on searching geo hashing, it's 
still a filter.  Internally, does it use geohasing and the fast bitmaps 
since it's a string match, then filter, or is it all filters and the hash 
is evaluated in memory for all documents?

http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.4/query-dsl-geohash-cell-filter.html



Thanks,
Todd



-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5ee6fb94-e0a2-4e39-8537-23c8c8f74fe0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Refresh on 1.4.0 not working as expected

2014-11-14 Thread Todd Nine
Hey guys,
  We're testing ES 1.4.0 from 1.3.2.  I'm noticing some strange behavior in 
our clients in our integration tests.  They perform the following logic.

Create an the first index in the cluster (single node) with a custom 
 __default__ dynamic mapping

Add 3 documents, each of a a new type to the index. egg, muffin, oj

Perform a refresh on the new index

Perform a match all query, with the types egg, muffin, and oj.  I receive 
not responses

Perform the same query above again, I receive responses, and I receive all 
documents consistently after refresh.


This only seems to be an issue when the following is true.  On subsequent 
tests, I don't see this behavior.

1) Its the first time refresh is called on the first index created in the 
system
2) The index uses default type mappings


If I roll back to 1.3.2, I don't see this.  I haven't tried 1.3.5.  Has 
something changed internally on index refresh?  Note that we use the 
refresh operation sparingly and mostly in rare admin functions of our 
system, or in integration testing.  We rarely use this in production.

Thanks,
Todd


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5d1c2295-1627-427e-a5a4-ba9764bb938a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: NPE from server when using query + geo filter + sort

2014-11-11 Thread Todd Nine
You're right Jorg.  There was an issue where types was incorrectly set to 
null because more than 1 was specified.  As a result, it passed the check 
for at least 1 element in the array, even though the type itself in element 
0 was null.  Thank you for your help!



On Tuesday, November 11, 2014 12:48:22 PM UTC-7, Jörg Prante wrote:
>
> Looks like a bug in org.apache.usergrid.persistence.index.impl.
> EsEntityIndexImpl
>
> Check if the "types" are set to a non-null value in the SearchRequest. If 
> you force them to be a null value, SearchRequest will throw the NPE you 
> posted.
>
> Jörg
>
> On Tue, Nov 11, 2014 at 6:11 PM, Todd Nine 
> > wrote:
>
>> Hi all,
>>   I'm getting some strange behavior from the ES server when using a term 
>> query + a geo distance filter + a sort.   I've tried this with 1.3.2, 
>> 1.3.5, as well as 1.4.0.  All exhibit this same behavior.  I'm using the 
>> Java transport client.  Here is my SearchRequestBuilder payload in 
>> toString() format.
>>
>>
>> query {   "from" : 0,   "size" : 10,   "query" : { "term" : {   
>> "ug_context" : 
>> "2abaf75a-69c4-11e4-918e-4658cde16dad__application__zzzcollzzzusers" } 
>>   },   "post_filter" : { "geo_distance" : {   "go_location" : [ 
>> -122.40474700927734, 37.77427673339844 ],   "distance" : "200.0m" } 
>>   },   "sort" : [ { "su_created" : {   "order" : "asc",   
>> "ignore_unmapped" : true }   }, { "nu_created" : {   "order" : 
>> "asc",   "ignore_unmapped" : true }   }, { "bu_created" : { 
>>   "order" : "asc",   "ignore_unmapped" : true }   } ] } limit 10
>>
>>
>> Here is the stack trace
>>
>> org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: 
>> Failed execution
>> at 
>> org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(AdapterActionFuture.java:90)
>> at 
>> org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:50)
>> at 
>> org.apache.usergrid.persistence.index.impl.EsEntityIndexImpl.search(EsEntityIndexImpl.java:307)
>> at 
>> org.apache.usergrid.corepersistence.CpRelationManager.searchConnectedEntities(CpRelationManager.java:1453)
>> at 
>> org.apache.usergrid.corepersistence.CpEntityManager.searchConnectedEntities(CpEntityManager.java:1585)
>> at org.apache.usergrid.persistence.GeoIT.testGeo(GeoIT.java:202)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at 
>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>> at 
>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>> at 
>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>> at 
>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>> at 
>> org.apache.usergrid.CoreApplication$1.evaluate(CoreApplication.java:133)
>> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>> at 
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>> at 
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
>> at org.apache.usergrid.CoreITSetupImpl$1.evaluate(CoreITSetupImpl.java:66)
>> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>> at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
>> at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
>> at 
>> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
>> at 
>> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(

Re: NPE from server when using query + geo filter + sort

2014-11-11 Thread Todd Nine
I just noticed I had a typo in my query, this is the query payload I'm 
executing if I run it in HTTP (which works)

{   
"from" : 0,   
"size" : 10,   
"query" : {
 "term" : {
  "ug_context" : 
"c2d2d78a-69cc-11e4-b22e-81db7b9aa660__user__zzzconnzzzlikes" 
}   
},   

"post_filter" : {
 "geo_distance" : {
"go_location" : [ -122.40474700927734, 37.77427673339844 ], 
  
"distance" : "200.0m" }  
 },  
 
 "sort" : [ 
{ "su_created" : 
{   "order" : "asc",  
 "ignore_unmapped" : true 
}  
 }, 
 
 { "nu_created" : 
{   "order" : "asc",  
 "ignore_unmapped" : true 
}  
}, 
{
 "bu_created" : {    
   "order" : "asc",  
 "ignore_unmapped" : true 
} 
 } 
]
}


On Tuesday, November 11, 2014 10:11:46 AM UTC-7, Todd Nine wrote:
>
> Hi all,
>   I'm getting some strange behavior from the ES server when using a term 
> query + a geo distance filter + a sort.   I've tried this with 1.3.2, 
> 1.3.5, as well as 1.4.0.  All exhibit this same behavior.  I'm using the 
> Java transport client.  Here is my SearchRequestBuilder payload in 
> toString() format.
>
>
> query {   "from" : 0,   "size" : 10,   "query" : { "term" : {   
> "ug_context" : 
> "2abaf75a-69c4-11e4-918e-4658cde16dad__application__zzzcollzzzusers" } 
>   },   "post_filter" : { "geo_distance" : {   "go_location" : [ 
> -122.40474700927734, 37.77427673339844 ],   "distance" : "200.0m" } 
>   },   "sort" : [ { "su_created" : {   "order" : "asc",   
> "ignore_unmapped" : true }   }, { "nu_created" : {   "order" : 
> "asc",   "ignore_unmapped" : true }   }, { "bu_created" : { 
>   "order" : "asc",   "ignore_unmapped" : true }   } ] } limit 10
>
>
> Here is the stack trace
>
> org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: 
> Failed execution
> at 
> org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(AdapterActionFuture.java:90)
> at 
> org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:50)
> at 
> org.apache.usergrid.persistence.index.impl.EsEntityIndexImpl.search(EsEntityIndexImpl.java:307)
> at 
> org.apache.usergrid.corepersistence.CpRelationManager.searchConnectedEntities(CpRelationManager.java:1453)
> at 
> org.apache.usergrid.corepersistence.CpEntityManager.searchConnectedEntities(CpEntityManager.java:1585)
> at org.apache.usergrid.persistence.GeoIT.testGeo(GeoIT.java:202)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.apache.usergrid.CoreApplication$1.evaluate(CoreApplication.java:133)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
> at org.junit.runners.ParentRunner$2.evaluate(ParentR

NPE from server when using query + geo filter + sort

2014-11-11 Thread Todd Nine
Hi all,
  I'm getting some strange behavior from the ES server when using a term 
query + a geo distance filter + a sort.   I've tried this with 1.3.2, 
1.3.5, as well as 1.4.0.  All exhibit this same behavior.  I'm using the 
Java transport client.  Here is my SearchRequestBuilder payload in 
toString() format.


query {   "from" : 0,   "size" : 10,   "query" : { "term" : {   
"ug_context" : 
"2abaf75a-69c4-11e4-918e-4658cde16dad__application__zzzcollzzzusers" } 
  },   "post_filter" : { "geo_distance" : {   "go_location" : [ 
-122.40474700927734, 37.77427673339844 ],   "distance" : "200.0m" } 
  },   "sort" : [ { "su_created" : {   "order" : "asc",   
"ignore_unmapped" : true }   }, { "nu_created" : {   "order" : 
"asc",   "ignore_unmapped" : true }   }, { "bu_created" : { 
  "order" : "asc",   "ignore_unmapped" : true }   } ] } limit 10


Here is the stack trace

org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: 
Failed execution
at 
org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(AdapterActionFuture.java:90)
at 
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:50)
at 
org.apache.usergrid.persistence.index.impl.EsEntityIndexImpl.search(EsEntityIndexImpl.java:307)
at 
org.apache.usergrid.corepersistence.CpRelationManager.searchConnectedEntities(CpRelationManager.java:1453)
at 
org.apache.usergrid.corepersistence.CpEntityManager.searchConnectedEntities(CpEntityManager.java:1585)
at org.apache.usergrid.persistence.GeoIT.testGeo(GeoIT.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.apache.usergrid.CoreApplication$1.evaluate(CoreApplication.java:133)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.apache.usergrid.CoreITSetupImpl$1.evaluate(CoreITSetupImpl.java:66)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
Caused by: java.lang.NullPointerException
at 
org.elasticsearch.common.io.stream.StreamOutput.writeString(StreamOutput.java:223)
at 
org.elasticsearch.common.io.stream.HandlesStreamOutput.writeString(HandlesStreamOutput.java:55)
at 
org.elasticsearch.common.io.stream.StreamOutput.writeStringArray(StreamOutput.java:298)
at 
org.elasticsearch.action.search.SearchRequest.writeTo(SearchRequest.java:615)
at 
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:685)
at 
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199)
at 
org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:57)
at 
org.elasticsearch.client.transport.support.InternalTransportClient$1.doWithNode(InternalTransportClient.java:109)
at 
org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:202)
at 
org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106)
at 
org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:334)
at 
org.elasticsearch.client.transport.TransportClient.search(TransportClient.java:424)
at 
org.elasticsearch.action.search.SearchRequestBuilder.doExecute(SearchRequestBuilder.java:1116)
at 
org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91)
at 
org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65)

I've observed a couple of things.  I don't 

Help with a larger cluster in EC2 and Node clients failing to join

2014-10-30 Thread Todd Nine
Hey guys,
  We're running some load tests, and we're finding some of our clients are 
having issues joining the cluster.  We have the following setup running in 
EC2.


24 c3.4xlarge ES instances.  These instances are our data storage and 
search instances.
40 c3.xlarge Tomcat instances.  These are our ephemeral webapp instances.



In our tomcat instance, we're using the Node java client.  What I'm finding 
is that as the cluster comes up, only about 36~38 of the tomcat nodes ever 
discover the 24 nodes that are actually storing data.  We're using tcp 
discovery, and have the following code in our client.


 Settings settings = ImmutableSettings.settingsBuilder()

.put( "cluster.name", fig.getClusterName() )

// this assumes that we're using zen for host 
discovery.  Putting an
// explicit set of bootstrap hosts ensures we 
connect to a valid cluster.
.put( "discovery.zen.ping.unicast.hosts", allHosts )
.put( "discovery.zen.ping.multicast.enabled", "false" 
).put( "http.enabled", false )

.put( "client.transport.ping_timeout", 2000 ) // 
milliseconds
.put( "client.transport.nodes_sampler_interval", 100 )
.put( "network.tcp.blocking", true )
.put( "node.client", true )
.put( "node.name", nodeName )

.build();

log.debug( "Creating ElasticSearch client with settings: " + 
settings.getAsMap() );

where allHosts = "ip1:port, ip2:port" etc etc.


Not sure what we're doing wrong.  We have retry code so that if a client 
fails 20 consecutive times with a TransportException, the node instance is 
stopped, and a new instance is created.  This has worked in dealing with 
network hiccups in the past, however we can just never seem to get past the 
38 client mark.  Any ideas what we may be doing wrong?

Thanks,
Todd

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/55924250-3013-4f70-b3d2-b5381f8af993%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Help with profiling our code's usage of the Node java client using YourKit

2014-10-08 Thread Todd Nine
Hi All,
  I've been using ES for a while and I'm really enjoying it, but we have a 
few slow calls in our code.  I have a few hunches around the code that's 
using the client inefficiently, but I would like some definitive proof. 
 I've attempted to profile our application using YourKit when we're load 
testing, and I've found we're spending 30-35% of our "wall time" in the 
Netty classes of the ES Node client.  However, I can't tell where the call 
is originating from in our source, since the calls are asynchronous and I 
can't capture the stack.   Are there any debugging options I can turn on in 
the client that will help me with profiling so that I can track down the 
code in our system that's not using our client efficiently?   Any help 
would be greatly appreciated!

Thanks,
Todd

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/332d8e12-f045-4ff7-99aa-889490002b86%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Help with designing our document for graphs. Indexing single nodes in graph with thousands of incoming edges

2014-10-07 Thread Todd Nine
Hey Jorg,
  Thanks for your response, it's very helpful.  We're taking a similar
approach now, however it is slightly different.  We're using Cassandra as
our edge storage, and we won't be moving to Elastic Search for a bit.
We're very operationally familiar with Cassandra, and Elastic Search is a
new beast to us.  We're only going to use it for secondary indexing until
we become more comfortable with it, then we're going to transition more use
cases to it.


In our current implementation, entities contained within types of edges are
indexed by target type and and edge type into elastic search.  In the
example I gave above, we actually index 3 documents, all of which are the
same, except for the type to make seek more efficient.


{
docId: e1-v1
name: "duo",
openTime: 9,
closeTime: 18
_type: "george/likes"
}


{
docId: e1-v1
name: "duo",
openTime: 9,
closeTime: 18
_type: "dave/likes"
}

{
docId: e1-v1
name: "duo",
openTime: 9,
closeTime: 18
_type: "rod/likes"
}


We then search within the type of "dave/likes" for all restaurants dave
likes.  We end up with a lot more documents this way, but we won't hit the
document size issues we've discussed.  What sort of recommendations do you
feel we should have for shard size?  Right now we're just sticking with the
default 10, with a replica count of 2, and we seem to be doing well.
Ultimately we're going to change our client to point to an alias, and add
more indexes behind the alias as we expand.  Most apps will never need to
be more than 10 shards, a handful will need to expand into a few indexes.

Thoughts on this implementation?

Thanks,
Todd

On Tue, Oct 7, 2014 at 5:19 AM, joergpra...@gmail.com  wrote:

> Unfortunately, adding edges per update script will soon become expensive.
> Note, updating is a multistep process of reading the doc, looking up the
> field (often by fetching _source), and reindexing the whole(!) document
> (not only the new edge) plus the versioning conflict management in case you
> run concurrent updates. Also, this is the same procedure for removing an
> edge. This is a huge difference to graph algorithms, where it is very cheap
> to add/remove edges. Script updates will work to a certain extent quite
> satisfactory, but you are right, if you want to add millions of edges to an
> ES doc one by one, this will not be efficient.
>
> So I would like to suggest to avoid the overhead of updating fields by
> script in preference to add / remove relations by their "relation id", i.e.
> to treat relations as first citizen docs. Adding millions of docs to an ES
> index is cheaper than a million scripted updates on a single field.
>
> Jörg
>
>
> On Tue, Oct 7, 2014 at 1:23 AM, Todd Nine  wrote:
>
>> Hi Jorg,
>>   Thanks for the response.  I don't actually need to model the
>> relationship per se, more that a document is used in a relationship via a
>> filter, then search on it's properties.  See the example below for more
>> clarity.
>>
>>
>> Restaurant: => {name: "duo"}
>>
>> Now, lets say I have 3 users,
>>
>> George, Dave and Rod
>>
>> George Dave and Rod all "like" the restaurant Duo.  These are directed
>> edges from the user, of type "likes" to the "duo" document.  We store these
>> edges in Cassandra.  Envision the document looking something like this.
>>
>>
>> {
>> name: "duo",
>> openTime: 9,
>> closeTime: 18
>> _in_edges: [ "george/likes", "dave/likes", "rod/likes" ]
>> }
>>
>> Then when searching, the user Dave would search something like this.
>>
>> select * where closeTime < 16
>>
>>
>> Which we translate in to a query, which is then also filtered by
>> _in_edges = "dave/likes".
>>
>> Our goal is to only create 1 document per node in our graph (in this
>> example restaurant), then possibly use the scripting API to add and remove
>> elements to the _in_edges fields and update the document.  My only concern
>> around this is document size.  It's not clear to me how to go about this
>> when we start getting millions of edges to that same target node, or
>> _in_edges field could grow to be millions of fields long.  At that point,
>>  is it more efficient to de-normalize and just turn "dave/likes",
>> "rod/likes", and "george/likes" into document types and store multiple
>> copies?
>>
>> Thanks,
>> Todd
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Oct 4, 2014 at 2:52 AM, joergpra...@gmail.com <

Re: Help with designing our document for graphs. Indexing single nodes in graph with thousands of incoming edges

2014-10-06 Thread Todd Nine
Hi Jorg,
  Thanks for the response.  I don't actually need to model the relationship
per se, more that a document is used in a relationship via a filter, then
search on it's properties.  See the example below for more clarity.


Restaurant: => {name: "duo"}

Now, lets say I have 3 users,

George, Dave and Rod

George Dave and Rod all "like" the restaurant Duo.  These are directed
edges from the user, of type "likes" to the "duo" document.  We store these
edges in Cassandra.  Envision the document looking something like this.


{
name: "duo",
openTime: 9,
closeTime: 18
_in_edges: [ "george/likes", "dave/likes", "rod/likes" ]
}

Then when searching, the user Dave would search something like this.

select * where closeTime < 16


Which we translate in to a query, which is then also filtered by _in_edges
= "dave/likes".

Our goal is to only create 1 document per node in our graph (in this
example restaurant), then possibly use the scripting API to add and remove
elements to the _in_edges fields and update the document.  My only concern
around this is document size.  It's not clear to me how to go about this
when we start getting millions of edges to that same target node, or
_in_edges field could grow to be millions of fields long.  At that point,
 is it more efficient to de-normalize and just turn "dave/likes",
"rod/likes", and "george/likes" into document types and store multiple
copies?

Thanks,
Todd








On Sat, Oct 4, 2014 at 2:52 AM, joergpra...@gmail.com  wrote:

> Not sure if this helps but I use a variant of graphs in ES, it is called
> Linked Data (JSON-LD)
>
> By using JSON-LD, you can index something like
>
> doc index: graph
> doc type: relations
> doc id: ...
>
> {
>"user" : {
>   "id" : "...",
>   "label" : "Bob",
>   "likes" : "restaurant:Duo"
>   }
> }
>
> for the statement "Bob likes restaurant Duo"
>
> and then you can run ES queries on the field "likes" or better
> "user.likes" for finding the users that like a restaurant etc. Referencing
> the "id" it is possible to lookup another document in another index about
> "Bob".
>
> Just to give an idea how you can model relations in structured ES JSON
> objects.
>
> Jörg
>
>
> On Fri, Oct 3, 2014 at 7:59 PM, Todd Nine  wrote:
>
>> So clearly I need to RTFM.  I missed this in the documentation the first
>> time.
>>
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/mapping.html#_how_types_are_implemented
>>
>> Will filters at this scale be fast enough?
>>
>>
>>
>> On Friday, October 3, 2014 11:48:40 AM UTC-6, Todd Nine wrote:
>>>
>>> Hey guys,
>>>   We're currently storing entities and edges in Cassandra.  The entities
>>> are JSON, and edges are directed edges with a source---type-->target.
>>> We're using ElasticSearch for indexing and I could really use a hand with
>>> design.
>>>
>>> What we're doing currently, is we take an entity, and turn it's JSON
>>> into a document.  We then create multiple copies of our document and change
>>> it's type to match the index.  For instance, Image the following use case.
>>>
>>>
>>> bob(user) -- likes -- > Duo (restaurant)   ===> Document Type  =
>>> bob(user) + likes + restaurant ; bob(user) + likes
>>>
>>>
>>> bob(user) -- likes -> Root Down (restaurant)  ===> Document Type  =
>>> bob(user) + likes+ restaurant ; bob(user) + likes
>>>
>>> bob(user) -- likes --> Coconut Porter (beer). ===> Document Types =
>>> bob(user) + likes + beer; bob(user) + likes
>>>
>>>
>>> When we index using this scheme we create 3 documents based on the
>>> restaurants Duo and Root Down, and the beer Coconut Porter.  We then store
>>> this document 2x, one for it's specific type, and one in the "all" bucket.
>>>
>>> Essentially, the document becomes a node in the graph.  For each
>>> incoming directed edge, we're storing 2x documents and changing the type.
>>> This gives us fast seeks when we search by type, but a LOT of data bloat.
>>> Would it instead be more efficient to keep an array of incoming edges in
>>> the document, then add it to our search terms?  For instance, should we
>>> instead have a document like this?
>>>
>>>
>>> docId: Duo(restaurant)
>>>
>>

Re: Help with designing our document for graphs. Indexing single nodes in graph with thousands of incoming edges

2014-10-03 Thread Todd Nine
So clearly I need to RTFM.  I missed this in the documentation the first 
time.

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/mapping.html#_how_types_are_implemented

Will filters at this scale be fast enough?



On Friday, October 3, 2014 11:48:40 AM UTC-6, Todd Nine wrote:
>
> Hey guys,
>   We're currently storing entities and edges in Cassandra.  The entities 
> are JSON, and edges are directed edges with a source---type-->target. 
>  We're using ElasticSearch for indexing and I could really use a hand with 
> design.
>
> What we're doing currently, is we take an entity, and turn it's JSON into 
> a document.  We then create multiple copies of our document and change it's 
> type to match the index.  For instance, Image the following use case.
>
>
> bob(user) -- likes -- > Duo (restaurant)   ===> Document Type  = bob(user) 
> + likes + restaurant ; bob(user) + likes
>  
>
> bob(user) -- likes -> Root Down (restaurant)  ===> Document Type  = 
> bob(user) + likes+ restaurant ; bob(user) + likes
>
> bob(user) -- likes --> Coconut Porter (beer). ===> Document Types = 
> bob(user) + likes + beer; bob(user) + likes
>
>
> When we index using this scheme we create 3 documents based on the 
> restaurants Duo and Root Down, and the beer Coconut Porter.  We then store 
> this document 2x, one for it's specific type, and one in the "all" bucket.  
>
> Essentially, the document becomes a node in the graph.  For each incoming 
> directed edge, we're storing 2x documents and changing the type.  This 
> gives us fast seeks when we search by type, but a LOT of data bloat.  Would 
> it instead be more efficient to keep an array of incoming edges in the 
> document, then add it to our search terms?  For instance, should we instead 
> have a document like this?
>
>
> docId: Duo(restaurant)
>
> edges: [ "bob(user) + likes + restaurant", "bob(user) + likes" ]
>
> When searching where edges = "bob(user) + likes + restaurant"?
>
>
> I don't know internally what specifying type actually does, if it just 
> treats it as as field, or if it changes the routing of the response?In 
> a social situation millions of people can be connected to any one entity, 
> so we have to have a scheme that won't fall over when we get to that case.
>
> Any help would be greatly appreciated!
>
> Thanks,
> Todd
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Help with designing our document for graphs. Indexing single nodes in graph with thousands of incoming edges

2014-10-03 Thread Todd Nine
Hey guys,
  We're currently storing entities and edges in Cassandra.  The entities 
are JSON, and edges are directed edges with a source---type-->target. 
 We're using ElasticSearch for indexing and I could really use a hand with 
design.

What we're doing currently, is we take an entity, and turn it's JSON into a 
document.  We then create multiple copies of our document and change it's 
type to match the index.  For instance, Image the following use case.


bob(user) -- likes -- > Duo (restaurant)   ===> Document Type  = bob(user) 
+ likes + restaurant ; bob(user) + likes
 

bob(user) -- likes -> Root Down (restaurant)  ===> Document Type  = 
bob(user) + likes+ restaurant ; bob(user) + likes

bob(user) -- likes --> Coconut Porter (beer). ===> Document Types = 
bob(user) + likes + beer; bob(user) + likes


When we index using this scheme we create 3 documents based on the 
restaurants Duo and Root Down, and the beer Coconut Porter.  We then store 
this document 2x, one for it's specific type, and one in the "all" bucket.  

Essentially, the document becomes a node in the graph.  For each incoming 
directed edge, we're storing 2x documents and changing the type.  This 
gives us fast seeks when we search by type, but a LOT of data bloat.  Would 
it instead be more efficient to keep an array of incoming edges in the 
document, then add it to our search terms?  For instance, should we instead 
have a document like this?


docId: Duo(restaurant)

edges: [ "bob(user) + likes + restaurant", "bob(user) + likes" ]

When searching where edges = "bob(user) + likes + restaurant"?


I don't know internally what specifying type actually does, if it just 
treats it as as field, or if it changes the routing of the response?In 
a social situation millions of people can be connected to any one entity, 
so we have to have a scheme that won't fall over when we get to that case.

Any help would be greatly appreciated!

Thanks,
Todd

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/84b745c3-686a-4e9e-a02a-2816f90a23a1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Upper bounds on the number of indexes in an elastic search cluster

2014-09-26 Thread Todd Nine
It sounds like we're going to need to test our upper bounds of indexes
(with no data) to see how many we can support.  We may need to re-evaluate
our thoughts on an index per app.  We might be better off doing a
statically sized set of indexes, then consistently hashing our applications
to those indexes.  Thank you for your help!

On Fri, Sep 26, 2014 at 5:44 PM, joergpra...@gmail.com <
joergpra...@gmail.com> wrote:

> If you consider tens of thousands of indices on tens of thousands of
> nodes, and the master node is the only node that can write to the cluster
> state, it will have lot of work to do to keep up with all cluster state
> updates.
>
> When the rate of changes to the cluster state increases, the master node
> will be challenged to propagate the state changes to all other nodes
> reliably and fast enough. There is no "distributed tree" yet, for example,
> there are no special "forwarder nodes" that communicate the cluster state
> to a partitioned set of nodes.
>
> See https://github.com/elasticsearch/elasticsearch/issues/6186
>
> The cluster state is a big compressed JSON structure which also must fit
> into the heap memory of master eligible nodes. ES also uses privileged
> network traffic channels for cluster state communication to take precedence
> over ordinary index/search messaging. But all these precautions may not be
> enough at some point. You can observe the point by retrieving a growing
> cluster state over the cluster state API and measure the size and time of
> this request.
>
> On the other hand, you can have calm cluster state and many thousand
> indices when the type mappings are constant, no field updates occur, and no
> nodes connect/disconnect. It all depends on the situation how you have to
> operate the cluster. One important thing is to allocate enough resources to
> master eligible data-less nodes so they are not hindered by extra
> search/index load.
>
> N.B. 20 nodes is not a big cluster. There are ES clusters of hundreds and
>  thousands of nodes. From my understanding, the communication of the master
> and 20 nodes is not a serious problem. This becomes an issue at ~500-1000
> nodes.
>
> Jörg
>
>
>
>
> On Sat, Sep 27, 2014 at 1:12 AM, Todd Nine  wrote:
>
>> Hi Jorg,
>>   We're storing each application in it's own Index so we can manage it
>> independently of others.  There's not set load or usage on our
>> applications.  Some will be very small, a few hundred documents.  Others
>> will be quite large, in the billions.  We have no way of knowing what the
>> usage profile is.  Rather, we're initially thinking that expansion will
>> occur using a combination of additional indexes and aliases referencing
>> those indexes.  This allows us to automate the management of the aliases
>> and indexes, and in turn allows us to scale them to the needs of the
>> application without over allocating unused capacity.   For instance, with
>> write heavy applications we can allocate more shards (via alias), for read
>> heavy application we can allocate more replicas.
>>
>>
>>
>> We're not running our cluster on a single node.  Our cluster is small to
>> begin with, it's 6 nodes in our current POC.  Ultimately I expect us to
>> grow each cluster we stand up to 20 or so nodes.  We'll expand as necessary
>> to support the number of shards and replicas and keep our performance up.
>> I'm not particularly worried about our ability to scale horizontally with
>> our hardware.
>>
>> Rather, I'm concerned with how far can we scale on our number of indexes,
>> and how does that relate to the number of machines?  When we keep adding
>> hardware, does this increase the upper bounds of the number of indexes we
>> can have?  Not the physical shards and replicas, but the routing
>> information for the master of the shards and location of the replicas.
>> I've done distributed data storage for many years, and none of the
>> documentation on ES makes it clear if this becomes an issue operationally.
>> I'm leery to just assume it will "just work".  When implementing something
>> like this, you either have to do a distributed tree for your meta data to
>> get the partitioning you need to scale infinitely, or every node must store
>> every shard's master information.   How does it work in ES?
>>
>>
>> Thanks,
>> Todd
>>
>>
>>
>> On Friday, September 26, 2014 4:29:53 PM UTC-6, Jörg Prante wrote:
>>>
>>> Why do you want to create huge number of indexes on just a single node?
>>>
>>> There are smarter method

Re: Upper bounds on the number of indexes in an elastic search cluster

2014-09-26 Thread Todd Nine
Hi Jorg,
  We're storing each application in it's own Index so we can manage it 
independently of others.  There's not set load or usage on our 
applications.  Some will be very small, a few hundred documents.  Others 
will be quite large, in the billions.  We have no way of knowing what the 
usage profile is.  Rather, we're initially thinking that expansion will 
occur using a combination of additional indexes and aliases referencing 
those indexes.  This allows us to automate the management of the aliases 
and indexes, and in turn allows us to scale them to the needs of the 
application without over allocating unused capacity.   For instance, with 
write heavy applications we can allocate more shards (via alias), for read 
heavy application we can allocate more replicas.



We're not running our cluster on a single node.  Our cluster is small to 
begin with, it's 6 nodes in our current POC.  Ultimately I expect us to 
grow each cluster we stand up to 20 or so nodes.  We'll expand as necessary 
to support the number of shards and replicas and keep our performance up. 
 I'm not particularly worried about our ability to scale horizontally with 
our hardware.  

Rather, I'm concerned with how far can we scale on our number of indexes, 
and how does that relate to the number of machines?  When we keep adding 
hardware, does this increase the upper bounds of the number of indexes we 
can have?  Not the physical shards and replicas, but the routing 
information for the master of the shards and location of the replicas. 
 I've done distributed data storage for many years, and none of the 
documentation on ES makes it clear if this becomes an issue operationally. 
 I'm leery to just assume it will "just work".  When implementing something 
like this, you either have to do a distributed tree for your meta data to 
get the partitioning you need to scale infinitely, or every node must store 
every shard's master information.   How does it work in ES?


Thanks,
Todd



On Friday, September 26, 2014 4:29:53 PM UTC-6, Jörg Prante wrote:
>
> Why do you want to create huge number of indexes on just a single node? 
>
> There are smarter methods to scale. Use over-allocation of shards. This is 
> explained by kimchy in this thread
>
>
> http://elasticsearch-users.115913.n3.nabble.com/Over-allocation-of-shards-td3673978.html
>  
> <http://www.google.com/url?q=http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FOver-allocation-of-shards-td3673978.html&sa=D&sntz=1&usg=AFQjCNEk7KTtpuEot3JtBBmMRMpH25vLDA>
>
> TL;DR you can create many thousands of aliases on a single (or few) 
> indices with just a few shards. There is no limit defined by ES, when your 
> configuration / hardware capacity is exceeded, you will see the node 
> getting sluggish.
>
> Jörg
>
> On Fri, Sep 26, 2014 at 11:23 PM, Todd Nine  > wrote:
>
>> Hey guys.  We’re building a Multi tenant application, where users create 
>> applications within our single server.For our current ES scheme, we're 
>> building an index per application.  Are there any stress tests or 
>> documentation on the upper bounds of the number of indexes a cluster can 
>> handle?  From my current understanding of meta data and routing, ever node 
>> caches the meta data of all the indexes and shards for routing.  At some 
>> point, this will obviously overwhelm the node.  Is my current understanding 
>> correct, or is this information partitioned across the cluster as well?
>>
>>
>> Thanks,
>>
>> Todd
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/200e1ce7-c56f-49d4-9c02-4b1dcc570bf2%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/200e1ce7-c56f-49d4-9c02-4b1dcc570bf2%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/595db4ab-b2f3-4dfe-bbf0-e4c13926e75e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Upper bounds on the number of indexes in an elastic search cluster

2014-09-26 Thread Todd Nine
 

Hey guys.  We’re building a Multi tenant application, where users create 
applications within our single server.For our current ES scheme, we're 
building an index per application.  Are there any stress tests or 
documentation on the upper bounds of the number of indexes a cluster can 
handle?  From my current understanding of meta data and routing, ever node 
caches the meta data of all the indexes and shards for routing.  At some 
point, this will obviously overwhelm the node.  Is my current understanding 
correct, or is this information partitioned across the cluster as well?


Thanks,

Todd

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/200e1ce7-c56f-49d4-9c02-4b1dcc570bf2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Shard count and plugin questions

2014-06-10 Thread Todd Nine
Hey Mark,
  Thanks for this.  It seems like our best bet will be to manage indexes
the same across all regions, since they're really mirrors.  Since our
documents are immutable, we'll just queue them up for each region, which
will insert or delete them into their index in the region.  It's the only
solution I can think of, since we want to continue processing when WAN
connections go down.


On Tue, Jun 10, 2014 at 3:24 PM, Mark Walkom 
wrote:

> There are a few people in the IRC channel that have done it, however,
> generally, cross-WAN clusters are not recommended as ES is sensitive to
> latency.
>
> You may be better off using the snapshot/restore process, or another
> export/import method.
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com
> web: www.campaignmonitor.com
>
>
> On 11 June 2014 03:11, Todd Nine  wrote:
>
>> Hey guys,
>>   One last question.  Does anyone do multi region replication with ES?
>>  My current understanding is that with a multi region cluster, documents
>> will be routed to the Region with a node that "owns" the shard the document
>> is being written to.  In our use cases, our cluster must survive a WAN
>> outage.  We don't want the latency of the writes or reads crossing the WAN
>> connection.  Our documents are immutable, so we can work with multi region
>> writes.  We simply need to replicate the write to other regions, as well as
>> the deletes.  Are there any examples or implementations of this?
>>
>> Thanks,
>> Todd
>>
>> On Thursday, June 5, 2014 4:11:44 PM UTC-6, Jörg Prante wrote:
>>>
>>> Yes, routing is very powerful. The general use case is to introduce a
>>> mapping to a large number of shards so you can store parts of data all at
>>> the same shard which is good for locality concepts. For example, combined
>>> with index alias working on filter terms, you can create one big concrete
>>> index, and use segments of it, so many thousands of users can share a
>>> single index.
>>>
>>> Another use case for routing might be a time windowed index where each
>>> shard holds a time window. There are many examples around logstash.
>>>
>>> The combination of index alias and routing is also known as "shard
>>> overallocation". The concept might look complex first but the other option
>>> would be to create a concrete index for every user which might also be a
>>> waste of resources.
>>>
>>> Though some here on this list have managed to run 1s of shards on a
>>> single node I still find this breathtaking - a few dozens of shards per
>>> node should be ok. Each shard takes some MB on the heap (there are tricks
>>> to reduce this a bit) but a high number of shards takes a handful of
>>> resources even without executing a single query. There might be other
>>> factors worth considering, for example a size limit for a single shard. It
>>> can be quite handy to let ES having move around shards of 1-10 GB instead
>>> of a few 100 GB - it is faster at index recovery or at reallocation time.
>>>
>>> Jörg
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jun 5, 2014 at 9:44 PM, Todd Nine  wrote:
>>>
>>>> Hey Jorg,
>>>>   Thanks for the reply.  We're using Cassandra heavily in production,
>>>> I'm very familiar with the scale out out concepts.  What we've seen in all
>>>> our distributed systems is that at some point, you reach a saturation of
>>>> your capacity for a single node.  In the case of ES, to me that would seem
>>>> to be shard count.  Eventually, all 5 shards can become too large for a
>>>> node to handle updates and reads efficiently.  This can be caused by a high
>>>> number of documents or document size, or both.  Once we reach this state,
>>>> that index is "full" in the sense that the nodes containing these can no
>>>> longer continue to service traffic at the rate we need it to.  We have 2
>>>> options.
>>>>
>>>> 1) Get bigger hardware.  We do this occasionally, but not ideal since
>>>> this is a distributed system.
>>>>
>>>> 2) Scale out, as you said.  In the case of write throughput it seems
>>>> that we can do this with a pattern of alias + new index, but it's not clear
>>>> to me if that's the right approach.  My initial thinking is to define  some
>>>> sort of routing that

Re: Shard count and plugin questions

2014-06-10 Thread Todd Nine
Hey guys,
  One last question.  Does anyone do multi region replication with ES?  My 
current understanding is that with a multi region cluster, documents will 
be routed to the Region with a node that "owns" the shard the document is 
being written to.  In our use cases, our cluster must survive a WAN outage. 
 We don't want the latency of the writes or reads crossing the WAN 
connection.  Our documents are immutable, so we can work with multi region 
writes.  We simply need to replicate the write to other regions, as well as 
the deletes.  Are there any examples or implementations of this?

Thanks,
Todd

On Thursday, June 5, 2014 4:11:44 PM UTC-6, Jörg Prante wrote:
>
> Yes, routing is very powerful. The general use case is to introduce a 
> mapping to a large number of shards so you can store parts of data all at 
> the same shard which is good for locality concepts. For example, combined 
> with index alias working on filter terms, you can create one big concrete 
> index, and use segments of it, so many thousands of users can share a 
> single index.
>
> Another use case for routing might be a time windowed index where each 
> shard holds a time window. There are many examples around logstash.
>
> The combination of index alias and routing is also known as "shard 
> overallocation". The concept might look complex first but the other option 
> would be to create a concrete index for every user which might also be a 
> waste of resources.
>
> Though some here on this list have managed to run 1s of shards on a 
> single node I still find this breathtaking - a few dozens of shards per 
> node should be ok. Each shard takes some MB on the heap (there are tricks 
> to reduce this a bit) but a high number of shards takes a handful of 
> resources even without executing a single query. There might be other 
> factors worth considering, for example a size limit for a single shard. It 
> can be quite handy to let ES having move around shards of 1-10 GB instead 
> of a few 100 GB - it is faster at index recovery or at reallocation time.
>
> Jörg
>
>
>
>
>
>
>
> On Thu, Jun 5, 2014 at 9:44 PM, Todd Nine > 
> wrote:
>
>> Hey Jorg,
>>   Thanks for the reply.  We're using Cassandra heavily in production, I'm 
>> very familiar with the scale out out concepts.  What we've seen in all our 
>> distributed systems is that at some point, you reach a saturation of your 
>> capacity for a single node.  In the case of ES, to me that would seem to be 
>> shard count.  Eventually, all 5 shards can become too large for a node to 
>> handle updates and reads efficiently.  This can be caused by a high number 
>> of documents or document size, or both.  Once we reach this state, that 
>> index is "full" in the sense that the nodes containing these can no longer 
>> continue to service traffic at the rate we need it to.  We have 2 options.  
>>
>> 1) Get bigger hardware.  We do this occasionally, but not ideal since 
>> this is a distributed system.
>>
>> 2) Scale out, as you said.  In the case of write throughput it seems that 
>> we can do this with a pattern of alias + new index, but it's not clear to 
>> me if that's the right approach.  My initial thinking is to define  some 
>> sort of routing that pivots on created date to the new index since that's 
>> an immutable field.  Thoughts?  
>>
>> In the case of read throughput, we an create more replicas.  Our systems 
>> is about 50/50 now, some users are even read/write, others are very read 
>> heavy.  I'll probably come up with 2 indexing strategies we can apply to an 
>> application's index based on the heuristics from the operations they're 
>> performing.
>>
>>
>> Thanks for the feedback!
>> Todd
>>
>>
>>
>>
>> On Thu, Jun 5, 2014 at 10:55 AM, joerg...@gmail.com  <
>> joerg...@gmail.com > wrote:
>>
>>> Thanks for raising the questions, I will come back later in more detail.
>>>
>>> Just a quick note, the idea about "shards scale write" and "replica 
>>> scale read" is correct, but Elasticsearch is also "elastic" which means it 
>>> "scales out", by adding node hardware. The shard/replica scale pattern 
>>> finds its limits in a node hardware, because shards/replica are tied to 
>>> machines, and there are the hard resource constraints, mostly disk I/O and 
>>> memory related. 
>>>
>>> In the end, you can take as a rule of thumb: 
>>>
>>> - add replica to scale "read" load
>>> - add new 

Re: Shard count and plugin questions

2014-06-05 Thread Todd Nine
Hey Jorg,
  Thanks for the reply.  We're using Cassandra heavily in production, I'm
very familiar with the scale out out concepts.  What we've seen in all our
distributed systems is that at some point, you reach a saturation of your
capacity for a single node.  In the case of ES, to me that would seem to be
shard count.  Eventually, all 5 shards can become too large for a node to
handle updates and reads efficiently.  This can be caused by a high number
of documents or document size, or both.  Once we reach this state, that
index is "full" in the sense that the nodes containing these can no longer
continue to service traffic at the rate we need it to.  We have 2 options.

1) Get bigger hardware.  We do this occasionally, but not ideal since this
is a distributed system.

2) Scale out, as you said.  In the case of write throughput it seems that
we can do this with a pattern of alias + new index, but it's not clear to
me if that's the right approach.  My initial thinking is to define  some
sort of routing that pivots on created date to the new index since that's
an immutable field.  Thoughts?

In the case of read throughput, we an create more replicas.  Our systems is
about 50/50 now, some users are even read/write, others are very read
heavy.  I'll probably come up with 2 indexing strategies we can apply to an
application's index based on the heuristics from the operations they're
performing.


Thanks for the feedback!
Todd




On Thu, Jun 5, 2014 at 10:55 AM, joergpra...@gmail.com <
joergpra...@gmail.com> wrote:

> Thanks for raising the questions, I will come back later in more detail.
>
> Just a quick note, the idea about "shards scale write" and "replica scale
> read" is correct, but Elasticsearch is also "elastic" which means it
> "scales out", by adding node hardware. The shard/replica scale pattern
> finds its limits in a node hardware, because shards/replica are tied to
> machines, and there are the hard resource constraints, mostly disk I/O and
> memory related.
>
> In the end, you can take as a rule of thumb:
>
> - add replica to scale "read" load
> - add new indices (i.e. new shards) to scale "write" load
> - and add nodes to scale out the whole cluster for both read and write load
>
> More later,
>
> Jörg
>
>
>
>
>
> On Thu, Jun 5, 2014 at 7:17 PM, Todd Nine  wrote:
>
>> Hey Jörg,
>>   Thank you for your response.  A few questions/points.
>>
>> In our use cases, the inability to write or read is considered a
>> downtime.  Therefore, I cannot disable writes during expansion.  Your alias
>> points raise
>> some interesting research I need to do, and I have a few follow up
>> questions.
>>
>>
>>
>> Our systems are fully multi tenant.  We currently intend to have 1 index
>> per application.  Each application can have a large number of types within
>> their index. Each cluster could potentially hold 1000 or more
>> applications/indexes.  Most users will never need more than 5 shards.
>> Some users are huge power users, billions of documents for a type with
>> several large types.  These are the users I am concerned with.
>>
>> From my understanding and experimentation, Elastic Search has 2 primary
>> mechanisms for tuning performance to handle the load.  First is the shard
>> count.  The higher the shard count, the more writes you can accept.  Each
>> shard has a master which accepts the write, and replicates the write to
>> it's replicas.  For high write throughput, you increase the count of shards
>> to distribute the load across more nodes.  For read throughput, you
>> increase the replica count.  This gives you higher performance on read,
>> since you now have more than 1 node per shard you can query to get results.
>>
>>
>>
>> Per your suggestion, rather than than copy the documents from an index
>> with 5 shards to an index with 10 shards, I can theoretically create a new
>> index then add it the alias. For instance, I envision this in the following
>> way.
>>
>>
>> Aliases:
>> app1-read
>> app1-write
>>
>>
>> Initial creation:
>>
>> app1-read -> Index: App1-index1 (5 shards, 2 replicas)
>> app1-write -> Index: App1-index1
>>
>> User begins to have too much data in App1-index1.  The 5 shards are
>> causing hotspots. The following actions take place.
>>
>> 1) Create App1-index2 (5 shards, 2 replicas)
>>
>> 2) Update app1-read -> App1-index1 and App1-index2
>>
>> 3) Update app1-read -> App1-index1 and App1-index2
>>
>>
>> I have some uncertainty around

Re: Shard count and plugin questions

2014-06-05 Thread Todd Nine
der
> managing a bunch of indices. If an old index is too small, just start a new
> one which is bigger and has more shards and spans more nodes, and add them
> to the existing set of indices. With index alias you can combine many
> indices into one index name. This is very powerful.
>
> If you can not estimate the data growth rate, I recommend also to use a
> reasonable number of shards from the very start. Say, if you expect 50
> servers to run an ES node on, then simply start with 50 shards on a small
> number of servers, and add servers over time. You won't have to bother
> about shard count for a very long time if you choose such a strategy.
>
> Do not think about rivers, they are not built for such use cases. Rivers
> are designed as a "play tool" for fetching data quickly from external
> sources, for demo purpose. They are discouraged for serious production use,
> they are not very reliable if they run unattended.
>
> Jörg
>
>
> On Thu, Jun 5, 2014 at 7:33 AM, Todd Nine  wrote:
>
>>
>>
>>
>>> 2) https://github.com/jprante/elasticsearch-knapsack might do what you
>>> want.
>>>
>>
>> This won't quite work for us.  We can't have any down time, so it seems
>> like an A/B system is more appropriate.  What we're currently thinking is
>> the following.
>>
>> Each index has 2 aliases, a read and a write alias.
>>
>> 1) Both read and write aliases point to an initial index. Say shard count
>> 5 replication 2 (ES is not our canonical data source, so we're ok with
>> reconstructing search data)
>>
>> 2) We detect via monitoring we're going to outgrow an index. We create a
>> new index with more shards, and potentially a higher replication depending
>> on read load.  We then update the write alias to point to both the old and
>> new index.  All clients will then being dual writes to both indexes.
>>
>> 3) While we're writing to old and new, some process (maybe a river?) will
>> begin copying documents updated < the write alias time from the old index
>> to the new index.  Ideally, it would be nice if each replica could copy
>> only it's local documents into the new index.  We'll want to throttle this
>> as well.  Each node will need additional operational capacity
>> to accommodate the dual writes as well as accepting the write of the "old"
>> documents.  I'm concerned if we push this through too fast, we could cause
>> interruptions of service.
>>
>>
>> 4) Once the copy is completed, the read index is moved to the new index,
>> then the old index is removed from the system.
>>
>> Could such a process be implemented as a plugin?  If the work can happen
>> in parallel across all nodes containing a shard we can increase the
>> process's speed dramatically.  If we have a single worker, like a river, it
>> might possibly take too long.
>>
>>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/4qO5BZSxWhc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGcxRGjZr%3DS9%3DSaOZYddWyNxjvFNJF41T_hHe7inoZOwQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGcxRGjZr%3DS9%3DSaOZYddWyNxjvFNJF41T_hHe7inoZOwQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf_wMHGZcK%2Bxozm%3Dxz4zh5prKmuob%3DBbXKzrHoVgiOk1wA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Shard count and plugin questions

2014-06-05 Thread Todd Nine
Thanks for the feedback Mark.

I agree with your thoughts on the testing.  We plan on doing some testing,
find our failure point, and dial that back to some value that allows us to
still run the migration.  This way, we can get ahead of the problem.  Since
a re-index would actually introduce more load on the system, we want to
perform it before we start to get too much of a performance problem on any
given index.



I wanted to clarify my thoughts on the realtime read.


I definitely do not want to read the commit log, that would be horribly
inefficient since it's just a log.  Rather, I would insert an in memory
cache into the write path.  The in memory cache would be written to as well
as the commit log.  When a query is executed on a node, it would query
Lucene (the shard), as well as search the node's local cache for matching
documents.  These are then merged into a single result set via whatever
ordering parameter is set.  When the documents are flushed to Lucene, they
would be removed from the cache.

I'm sure I'm not the first ES user to conceive this idea.  However, ES
doesn't have this functionality, and I'm assuming there's a reason for it.
 Is it simply no one has tried this use case, or is it not a good idea
technically?  Would it introduce too much memory overhead etc?

Thanks,
Todd




On Wed, Jun 4, 2014 at 11:05 PM, Mark Walkom 
wrote:

> I haven't heard of a limit to the number of indexes, obviously the more
> you have the larger the cluster state that needs to be maintained.
>
> You might want to look into routing (
> http://exploringelasticsearch.com/advanced_techniques.html or
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-routing-field.html)
> as an alternative to optimise and minimise index count.
> You can also always hedge your bets and create an index with a larger
> number of shards, ie not a 1:1, shard:node relationship, and then move the
> excess shards to new nodes as they are added.
>
>  I'd be interested to see how you could measure how you'd outgrow an
> index though, technically it can just keep growing until the node can no
> longer deal with it. This is something that testing is good for, throw data
> at a single shard index and then when it falls over you have an indicator
> of how your hardware will handle things.
>
> As for reading the transaction log and searching it, you might be playing
> a losing game as your code to parse and search would have to be super quick
> to make worth doing.
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com
> web: www.campaignmonitor.com
>
>
> On 5 June 2014 15:33, Todd Nine  wrote:
>
>> Thanks for the answers Mark.  See inline.
>>
>>
>> On Wed, Jun 4, 2014 at 3:51 PM, Mark Walkom 
>> wrote:
>>
>>> 1) The answer is - it depends. You want to setup a test system with
>>> indicative specs, and then throw some sample data at it until things start
>>> to break. However this may help
>>> https://www.found.no/foundation/sizing-elasticsearch/
>>>
>>
>> This is what I was expecting.  Thanks for the pointer to the
>> documentation.  We're going to have some pretty beefy clusters (SSDs Raid
>> 0, 8 to 16 cores and a lot of RAM) to power ES.  We're going to have a LOT
>> of indexes, we would be operating this as a core infrastructure service.
>>  Is there an upper limit on the amount of indexes a cluster can hold?
>>
>>
>>> 2) https://github.com/jprante/elasticsearch-knapsack might do what you
>>> want.
>>>
>>
>> This won't quite work for us.  We can't have any down time, so it seems
>> like an A/B system is more appropriate.  What we're currently thinking is
>> the following.
>>
>> Each index has 2 aliases, a read and a write alias.
>>
>> 1) Both read and write aliases point to an initial index. Say shard count
>> 5 replication 2 (ES is not our canonical data source, so we're ok with
>> reconstructing search data)
>>
>> 2) We detect via monitoring we're going to outgrow an index. We create a
>> new index with more shards, and potentially a higher replication depending
>> on read load.  We then update the write alias to point to both the old and
>> new index.  All clients will then being dual writes to both indexes.
>>
>> 3) While we're writing to old and new, some process (maybe a river?) will
>> begin copying documents updated < the write alias time from the old index
>> to the new index.  Ideally, it would be nice if each replica could copy
>> only it's local documents into the new index.  We

Re: Shard count and plugin questions

2014-06-04 Thread Todd Nine
Thanks for the answers Mark.  See inline.


On Wed, Jun 4, 2014 at 3:51 PM, Mark Walkom 
wrote:

> 1) The answer is - it depends. You want to setup a test system with
> indicative specs, and then throw some sample data at it until things start
> to break. However this may help
> https://www.found.no/foundation/sizing-elasticsearch/
>

This is what I was expecting.  Thanks for the pointer to the documentation.
 We're going to have some pretty beefy clusters (SSDs Raid 0, 8 to 16 cores
and a lot of RAM) to power ES.  We're going to have a LOT of indexes, we
would be operating this as a core infrastructure service.  Is there an
upper limit on the amount of indexes a cluster can hold?


> 2) https://github.com/jprante/elasticsearch-knapsack might do what you
> want.
>

This won't quite work for us.  We can't have any down time, so it seems
like an A/B system is more appropriate.  What we're currently thinking is
the following.

Each index has 2 aliases, a read and a write alias.

1) Both read and write aliases point to an initial index. Say shard count 5
replication 2 (ES is not our canonical data source, so we're ok with
reconstructing search data)

2) We detect via monitoring we're going to outgrow an index. We create a
new index with more shards, and potentially a higher replication depending
on read load.  We then update the write alias to point to both the old and
new index.  All clients will then being dual writes to both indexes.

3) While we're writing to old and new, some process (maybe a river?) will
begin copying documents updated < the write alias time from the old index
to the new index.  Ideally, it would be nice if each replica could copy
only it's local documents into the new index.  We'll want to throttle this
as well.  Each node will need additional operational capacity
to accommodate the dual writes as well as accepting the write of the "old"
documents.  I'm concerned if we push this through too fast, we could cause
interruptions of service.


4) Once the copy is completed, the read index is moved to the new index,
then the old index is removed from the system.

Could such a process be implemented as a plugin?  If the work can happen in
parallel across all nodes containing a shard we can increase the process's
speed dramatically.  If we have a single worker, like a river, it might
possibly take too long.


3) How real time is real time? You can change index.refresh_interval to
> something small so that window of "unflushed" items is minimal, but that
> will have other impacts.
>

Once the index call returns to the caller, it would be immediately
available for query.  We're tried lowering the refresh rate, this results
is a pretty significant drop in throughput.  To meet our throughput
requirements, we're considering even turning it up to 5 or 15 seconds.  If
we can then search this data that's in our commit log (via storing it in
memory until flush) that would be ideal.

Thoughts?



> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com
> web: www.campaignmonitor.com
>
>
> On 5 June 2014 04:18, Todd Nine  wrote:
>
>> Hi All,
>>   We've been using elastic search as our search index for our new
>> persistence implementation.
>>
>> https://usergrid.incubator.apache.org/
>>
>> I have a few questions I could use a hand with.
>>
>> 1) Is there any good documentation on the upper limit to count of
>> documents, or total index size, before you need to allocate more shards?
>>  Do shards have a real world limit on size or number of entries to keep
>> response times low?  Every system has it's limits, and I'm trying to find
>> some actual data on the size limits.  I've been trolling Google for some
>> answers, but I haven't really found any good test results.
>>
>>
>> 2) Currently, it's not possible to increase the shard count for an index.
>> The workaround is to create a new index with a higher count, and move
>> documents from the old index into the new.  Could this be accomplished via
>> a plugin?
>>
>>
>> 3) We sometimes have "realtime" requirements.  In that when an index call
>> is returned, it is available.  Flushing explicitly is not a good idea from
>> a performance perspective.Has anyone explored searching in memory the
>> documents that have not yet been flushed and merging them with the Lucene
>> results?  Is this something that's feasible to be implemented via a plugin?
>>
>> Thanks in advance!
>> Todd
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.

Re: cross data center replication

2014-06-04 Thread Todd Nine
Hey all,
 
Sorry to resurrect a dead thread.  Did you ever find a solution for 
eventual consistency of documents across EC2 regions?

Thanks,
todd



On Wednesday, May 1, 2013 5:50:00 AM UTC-7, Norberto Meijome wrote:
>
> +1 on all of the above. es-reindex already in my list of things to 
> investigate (for a number of issues...)
>
> cheers,
> b 
>
>
> On Wed, May 1, 2013 at 6:58 AM, Paul Hill 
> > wrote:
>
>> On 4/23/2013 8:44 AM, Daniel Maher wrote:
>>
>>> On 2013-04-23 5:22 PM, Saikat Kanjilal wrote:
>>>
 Hello Folks,
 [...] does ES out of the box currently support cross data
 center replication,  []

>>>
>>> Hello,
>>>
>>> I'd wager that the question you're really asking about is how to control 
>>> where shards are placed; if you can make deterministic statements about 
>>> where shards are, then you can create your own "rack-aware" or "data 
>>> centre-aware" scenarios.  ES has supported this "out of the box" for well 
>>> over a year now (possibly longer).
>>>
>>> You'll want to investigate "zones" and "routing allocation", which are 
>>> the key elements of shard placement.  There is an excellent blog post which 
>>> describes exactly how to set things up here :
>>> http://blog.sematext.com/2012/05/29/elasticsearch-shard-
>>> placement-control/ 
>>>
>>>  Is shard allocation really the correct solution if the data centers are 
>> globally distributed?
>>
>> If I have a data center in the US intended to server data from the US, 
>> but it should also have access to Europe and Asia data, and clusters in 
>> both Europe and Asia with similar needs, would I really want to use zones 
>> etc. and have one great global cluster with data center aware 
>> configurations?
>>
>> Assuming that the US would be happy to deal with old documents from Asia 
>> and Europe, when Asia or Europe is off line or just not caught up, it would 
>> seem that you would NOT want a "world" cluster, because I can't picture how 
>> you'd configure a 3-part world cluster for both index into the right 
>> indices, search the right (possible combination of) shards, but also 
>> preventing "split brain".
>>
>> In the scenerio, I've described, I would think each data center might 
>> better provide availability and eventual consistency (with less concern for 
>> the remote data from the other region) by having three clusters and some 
>> type of syncing from one index to copies at the other two locations.  For 
>> example, the US datacenter might have a US, copyOfEurope, and copyOfAsia 
>> index.
>>
>> Anyone have any observations about such a world-wide scenerio?
>> Are there any index to index copy utilities?
>> Is there a river or other plugin that might be useful for this three 
>> clusters working together scenerio?
>> How about the project https://github.com/karussell/elasticsearch-reindex?
>> Comments?
>>
>> -Paul
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>
>
> -- 
> Norberto 'Beto' Meijome
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/646067d1-1137-4777-be51-ced0bd6a3edd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Shard count and plugin questions

2014-06-04 Thread Todd Nine
Hi All,
  We've been using elastic search as our search index for our new 
persistence implementation.  

https://usergrid.incubator.apache.org/

I have a few questions I could use a hand with.

1) Is there any good documentation on the upper limit to count of 
documents, or total index size, before you need to allocate more shards? 
 Do shards have a real world limit on size or number of entries to keep 
response times low?  Every system has it's limits, and I'm trying to find 
some actual data on the size limits.  I've been trolling Google for some 
answers, but I haven't really found any good test results.


2) Currently, it's not possible to increase the shard count for an index. 
The workaround is to create a new index with a higher count, and move 
documents from the old index into the new.  Could this be accomplished via 
a plugin?


3) We sometimes have "realtime" requirements.  In that when an index call 
is returned, it is available.  Flushing explicitly is not a good idea from 
a performance perspective.Has anyone explored searching in memory the 
documents that have not yet been flushed and merging them with the Lucene 
results?  Is this something that's feasible to be implemented via a plugin?

Thanks in advance!
Todd

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/940c6404-6667-4846-b457-977e705d3797%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.