Re: ScrollId doesn't advance with 2 indexes on a read alias 1.4.4
Correct, I'm always getting the first entity back with the scroll id. This doesn't happen if I remove index A from the alias. I then get the results I expect. On Tue, Apr 14, 2015 at 12:05 PM, Roger de Cordova Farias < roger.far...@fontec.inf.br> wrote: > Are you sure that calling the same scroll_id won't return the next results? > > AFAIK, the scroll_id can be the same and still return new records > > 2015-04-14 14:26 GMT-03:00 Todd Nine : > >> Hey guys, >> I have 2 indexes. I have a read alias on both of the indexes (A and >> B), and a write alias on 1 (B). I then insert 10 documents to the write >> alias which inserts them into index B. I perform the following query. >> >> { >> "from" : 0, >> "size" : 1, >> "post_filter" : { >> "bool" : { >> "must" : { >> "term" : { >> "edgeSearch" : >> "4cd2ba95-e2c9-11e4-bb39-c6c6eebe8d56_application__4cd2ba96-e2c9-11e4-bb39-c6c6eebe8d56_owner__users__SOURCE" >> } >> } >> } >> }, >> "sort" : [ { >> "fields.double" : { >> "order" : "asc", >> "nested_filter" : { >> "term" : { >> "name" : "ordinal" >> } >> } >> } >> }, { >> "fields.long" : { >> "order" : "asc", >> "nested_filter" : { >> "term" : { >> "name" : "ordinal" >> } >> } >> } >> }, { >> "fields.string.exact" : { >> "order" : "asc", >> "nested_filter" : { >> "term" : { >> "name" : "ordinal" >> } >> } >> } >> }, { >> "fields.boolean" : { >> "order" : "asc", >> "nested_filter" : { >> "term" : { >> "name" : "ordinal" >> } >> } >> } >> } ] >> } >> >> I receive my first record, and a scroll id, as expected. >> >> On my next request, I perform a request with the the scroll Id from the >> first response. >> >> What I expect: I expect to receive my second record, and a new scrollId. >> >> What I get: I get the first record again, with the same scroll Id. >> >> I'm on a 1.4.4 server, with a 1.4.4 node client running locally >> integration testing. >> >> >> When I use the same logic on a read alias with a single index, I do not >> experience this problem, so I'm reasonably certain my client is coded >> correctly. >> >> >> Any ideas? >> >> Thanks, >> Todd >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearch+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/8a317e5f-eb6f-4aef-a257-3902d31c3567%40googlegroups.com >> <https://groups.google.com/d/msgid/elasticsearch/8a317e5f-eb6f-4aef-a257-3902d31c3567%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/2ZdScTQ6UkA/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAJp2530z-FubTUaxd7rRUDLHahRD3SD1Dz1r7nhF9PJkycdsuA%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CAJp2530z-FubTUaxd7rRUDLHahRD3SD1Dz1r7nhF9PJkycdsuA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf-T0uJKnB8fRBQ24Yda2Yp_yGUYCJ4jvsqo4yyAb6NxRA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
ScrollId doesn't advance with 2 indexes on a read alias 1.4.4
Hey guys, I have 2 indexes. I have a read alias on both of the indexes (A and B), and a write alias on 1 (B). I then insert 10 documents to the write alias which inserts them into index B. I perform the following query. { "from" : 0, "size" : 1, "post_filter" : { "bool" : { "must" : { "term" : { "edgeSearch" : "4cd2ba95-e2c9-11e4-bb39-c6c6eebe8d56_application__4cd2ba96-e2c9-11e4-bb39-c6c6eebe8d56_owner__users__SOURCE" } } } }, "sort" : [ { "fields.double" : { "order" : "asc", "nested_filter" : { "term" : { "name" : "ordinal" } } } }, { "fields.long" : { "order" : "asc", "nested_filter" : { "term" : { "name" : "ordinal" } } } }, { "fields.string.exact" : { "order" : "asc", "nested_filter" : { "term" : { "name" : "ordinal" } } } }, { "fields.boolean" : { "order" : "asc", "nested_filter" : { "term" : { "name" : "ordinal" } } } } ] } I receive my first record, and a scroll id, as expected. On my next request, I perform a request with the the scroll Id from the first response. What I expect: I expect to receive my second record, and a new scrollId. What I get: I get the first record again, with the same scroll Id. I'm on a 1.4.4 server, with a 1.4.4 node client running locally integration testing. When I use the same logic on a read alias with a single index, I do not experience this problem, so I'm reasonably certain my client is coded correctly. Any ideas? Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a317e5f-eb6f-4aef-a257-3902d31c3567%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Creating a no-op QueryBuilder and FilterBuilder
Hey all, I have a bit of an odd question, hopefully someone can give me an answer. In usergrid, we have our own existing query language. We parse this language into an AST, then visit each of the nodes and construct an ES query with the java client. So far, very straight forward. However, we have 1 wrench in the works. Some of our operations can only be implemented as filters. As a result, visiting some of our nodes results in a QueryBuilder operation, others in a FilterBuilder operation. To keep our context so that we still evaluate AND, OR, and NOT correctly, I'm pushing BoolQueryBuilder and BoolFilterBuilder on to our stacks as we traverse the tree. Within these bool terms, I will have both query and filter operations that represent a "no-op" operation. Is there such a build that already exists in the client, or will I need to create my own that doesn't append any content during the build phase? Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/634d71c1-c03e-476c-a293-f5114cfe35cf%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Set index.query.parse.allow_unmapped_fields = false for a single index
Figured this out (derp). So my next question, is how can I disable dynamic mappings so that they don't occur? Thanks, Todd On Thursday, April 2, 2015 at 3:55:07 PM UTC-6, Todd Nine wrote: > > Hey guys, > We're running ES as a core service, so I can't guarantee that my > application will be the only one using the cluster. Is it possible to > disallow unmapped fields explicitly for an index without modifying the > global setting? > > Thanks, > Todd > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/29f0c5e3-068d-4e19-9a3b-6cd38ec60e08%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Set index.query.parse.allow_unmapped_fields = false for a single index
Hey guys, We're running ES as a core service, so I can't guarantee that my application will be the only one using the cluster. Is it possible to disallow unmapped fields explicitly for an index without modifying the global setting? Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/19b5a2f9-88f4-4100-809d-3e258e24f6af%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Using scroll and different results sizes
Hey all, We use ES as our indexing sub system. Our canonical record store is Cassandra. Due to the denormalization we perform for faster query speed, occasionally the state of documents in ES can lag behind the state of our Cassandra instance. To accommodate this eventually consist system, our query path is the following. Query ES -> returns scrollId and document ids-> Load entities from Cassandra -> Drop "stale" documents from results and asynchronously remove them. If the user requested 10 entities, and only 8 were current, we will be missing 2 results and need to make another trip to ES to satisfy the result size. Is it possible to do the following? //get first page with 10 results POST /testindex/mytype/_search?scroll=1m {"from":0, "size": 10} POST /testing/mytype/_search?scroll_id= {"size":2} I always seem to get back 10 on my second request. If not other option is available, I'd rather truncate our result set size to < the user's requested size than take the performance hit of using from and size to drop results. Our document counts can be 50million+, so I'm concerned with the performance implications of not using scroll. Thoughts? Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4a6cf1a9-9b96-47c2-a3f6-0955b3e74283%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Very slow index speeds with dynamic mapping and large volume of documents with new fields
Hey all, We're bumping up against a production problem I could use a hand with. We're experiencing steadily decreasing index speeds. We have 12 c3.4xl data nodes, and 1 c3.8xl master node (with 2 backups that are smaller). We're indexing 45 million documents into a single index. Single shard only, no replicas. As our number of documents grow, our indexing speed slows to a crawl. We've applied all the standard mlockall, ulimit, and ssd merge throttling tuning settings, so I feel our cluster is pretty good. When I inspected the data, I've noticed our user is adding a new field on every document. When I view the pending tasks on our master, the task queue is always at least 300+ attempting to perform dynamic mapping. I've also checked segment merging, we never have more than 1 merge going on, and even then it lasts for a second or two, not long at all. This brings me to my question. When dynamic mapping is performed, is this on the master only? Obviously this would introduce a bottleneck, and explain our sudden performance drop. I'm at a loss to explain this issue. Any advice would be appreciated. Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0611317c-d3c1-4894-8fac-8ac4b36cbf15%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Unexpected shard allocation with 0 replica indexes
Thanks Mark, I've looked at these, as well as these configuration parameters. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-update-settings.html#_balanced_shards However, I've not been able to find a configuration that actually accomplishes what we're looking for. We're using the aws plugin as well, which doesn't seem to help any with our shard allocation. It does however, ensure we have replica's in other zones, which is obviously helpful for redundancy. Are there any known settings, either with allocation-awareness or cluster settings that will accomplish the behavior I'm looking for? Otherwise, we'll have to write client API to move the shards after index allocation to ensure we get our even load. Thanks, Todd On Sunday, February 22, 2015 at 3:25:28 PM UTC-7, Mark Walkom wrote: > > It's not a bug, ES allocates based on higher level primary shard count and > doesn't take into account what index a shard may belong to, this is where > allocation awareness and routing comes into play. > > Take a look at > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html#allocation-awareness > > On 23 February 2015 at 06:46, Todd Nine > > wrote: > >> Hi All, >> We have several indexes in our ES cluster. ES is not our canonical >> system record, we use it only for searching. >> >> Some of our applications have very high write throughput, so for these we >> allocate a singular primary shard for each of our nodes. For example, we >> have 6 nodes, and we create our index with 6 shards and 0 replicas. I >> would expect ES to balance primary shards per node in the index. However, >> it will load as many as 3 primaries on a single node, and some nodes will >> have no shards. While each node does appear to have the same number of >> primaries, not all our indexes have equal write throughput, therefore, I >> want the primary (and only) shards evenly distributed per index. >> >> During heavy writes, we experience slowness because our shards are not >> evenly distributed across our nodes. Currently we're getting around this >> by manually moving shards, but this seems to be something ES should do >> automatically. We're running version 1.4.3 with default allocation >> parameter. Is this a bug? >> >> Thanks, >> Todd >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com . >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/5cc33cd3-f7f6-4174-b815-cadc858c0b97%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/5cc33cd3-f7f6-4174-b815-cadc858c0b97%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a6b76c7-87f0-4a2f-a08f-db9a6fb871f1%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Unexpected shard allocation with 0 replica indexes
Hi All, We have several indexes in our ES cluster. ES is not our canonical system record, we use it only for searching. Some of our applications have very high write throughput, so for these we allocate a singular primary shard for each of our nodes. For example, we have 6 nodes, and we create our index with 6 shards and 0 replicas. I would expect ES to balance primary shards per node in the index. However, it will load as many as 3 primaries on a single node, and some nodes will have no shards. While each node does appear to have the same number of primaries, not all our indexes have equal write throughput, therefore, I want the primary (and only) shards evenly distributed per index. During heavy writes, we experience slowness because our shards are not evenly distributed across our nodes. Currently we're getting around this by manually moving shards, but this seems to be something ES should do automatically. We're running version 1.4.3 with default allocation parameter. Is this a bug? Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5cc33cd3-f7f6-4174-b815-cadc858c0b97%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: One shard continually fails to allocate
Hey Aaron, What do you get back if you try to use these sets of commands to manually allocate the shard to a node? http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-reroute.html I had this problem before, but it turned out we had 1 node that had accidentally be upgraded, and the rest were still on a previous version. I was able to determine this be reading the error output from the shard allocation command. Todd On Tuesday, February 17, 2015 at 11:48:48 PM UTC-8, David Pilato wrote: > > What gives > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cat-shards.html#cat-shards > ? > > -- > David ;-) > Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs > > Le 18 févr. 2015 à 06:44, Aaron C. de Bruyn > a écrit : > > All the servers have nearly 1 TB free space. > > -A > > On Tue, Feb 17, 2015 at 7:44 PM, David Pilato > wrote: > > It's a replica? > > Might be because you are running low on disk space? > > > -- > > David ;-) > > Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs > > > Le 18 févr. 2015 à 01:16, Aaron de Bruyn > > a écrit : > > > I have one shard that continually fails to allocate. > > > There is nothing in the logs that would seem to indicate a problem on any > of > > the servers. > > > The pattern of one of the copies of shard '2' not being allocated runs > > throughout all my logstash indexes. > > > Running 1.4.3 on all nodes. > > > Any pointers on what I should check? > > > Thanks, > > > -A > > > -- > > You received this message because you are subscribed to the Google Groups > > "elasticsearch" group. > > To unsubscribe from this group and stop receiving emails from it, send an > > email to elasticsearc...@googlegroups.com . > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/elasticsearch/e1a6111e-70a3-412a-a666-da61c479ee53%40googlegroups.com > . > > For more options, visit https://groups.google.com/d/optout. > > > > > > -- > > You received this message because you are subscribed to a topic in the > > Google Groups "elasticsearch" group. > > To unsubscribe from this topic, visit > > https://groups.google.com/d/topic/elasticsearch/iB--QW6ds-Y/unsubscribe. > > To unsubscribe from this group and all its topics, send an email to > > elasticsearc...@googlegroups.com . > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/elasticsearch/E00B59E6-1D1C-4727-AD0F-ABA5291D0E56%40pilato.fr > . > > For more options, visit https://groups.google.com/d/optout. > > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearc...@googlegroups.com . > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAEE%2BrGqur9RWW%2BorDGDCji9cUvp7Y2XmcT8D4M%3DCLx7%2BkiWofg%40mail.gmail.com > . > For more options, visit https://groups.google.com/d/optout. > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91d7e258-6886-4344-b990-5ca4d0b2888c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Uneven primary shard distribution with increasing index allocation
Hey All, We're on ES 1.4.3. Initially, I thought I had this issue, however it appears I was incorrect. https://github.com/elasticsearch/elasticsearch/issues/9023 We're seeing very uneven primary shard distribution as we're adding new indexes to our system. We're running a 6 node cluster. We're adding indexes with 12 shards, and 1 replica. What I'm noticing is that 1/4 (3) of the primary shards are be allocated to a single node. In the case of our application, this is causing severe performance problems since we're writing 1/3 of our throughput to a single node, leaving the remaining 2/3 spread over 5 nodes. We currently have 87 indexes and growing. I never noticed this with a handful of indexes, but as we're getting more it's becoming more prevalent. What can I do to diagnose and fix this issue? From the screenshot above, I've tried moving the primary shards, but this seems to have little effect. When catting the shards and checking how many primaries exist per node, they're very skewed curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res001sy|wc -l => 259 curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res002sy|wc -l => 228 curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res003sy|wc -l => 191 curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res004sy|wc -l => 106 curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res005sy|wc -l => 37 curl -X GET res001sy:9200/_cat/shards|grep "p STARTED"| grep res006sy|wc -l => 13 As you can see, 01 is our highest with 259 primary shards, however 06 only had 13. I've tried forcing shards to be rebalance with the command below but it seems to have no effect. Also, note that we're running the aws cloud plugin. https://github.com/elasticsearch/elasticsearch-cloud-aws. curl -XPOST 'res002sy:9200/_cluster/reroute' -d '{}' Any ideas what I can do to resolve this problem? The uneven distribution seems to killing our more heavily nodes.since they're taking a majority of the primary shard throughput. I'd prefer that most nodes have an approximately equal amount of primaries and secondaries, since our indexes are mixes of more write heavy or more read heavy. We seem to have very few that are 50/50. Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7c935128-e783-48cc-8134-3bc2de0c9aaa%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Topology of tribe nodes in a distributed replicated environment and automatic index + alias creation.
Hey guys, We have a slightly different use case than I'm able to find examples for with the tribe node. Any feedback would be appreciated. What we have now: Single region: We create indexes in our code automatically. They're based on timeuuids, so we never have to worry about them conflicting. We also create read and write aliases to these indexes. We read and write documents to our local region only. What we want: 2+ Region Topology us-west, us-east, asia pac What I'm envisioning. Our Code: We implement a tool that will create a log of all index creations, as well as read and write alias creation. These operations are then replayed individually on each region's cluster. If a region is offline, or unreachable, it won't block other regions getting created. Tribe Nodes: Write to all regions. Is there a way we can do this so that this returns to the caller after our local region has accepted the write? We don't want to wait for all regions to ack the write. Again if a region is offline, we'll lose that ability. We can re-index those documents later. Read: Always read from local cluster for efficiency and responsiveness. We're exploring all options for multi region deployments. Ultimately we'd like reads to occur locally for performance, and are ok with queueing background writes to each region. Thoughts? Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e6dcf03c-b023-4118-ac9b-4cf6701e4885%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Help creating a near real time streaming plugin to perform replication between clusters
Thanks for the suggestion on the tribe nodes. I'll take a look at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService more in depth. A reference implementation would be helpful in understanding it's usage, do you happen to know of any projects that use it? >From an architecture perspective, I'm concerned with having the cluster master initiate any replication operations aside from replaying index modifications. As we continue to increase our cluster size, I'm worried it may become too much load on the master to keep up. Our system is getting larger every day, we have 12 c3.4xl instances in each region currently. Our client to ES is a multi-tennant system (http://usergrid.incubator.apache.org/), so each application created in the system will get it's own indexes in ES. This allows us to scale the indexes using read/write aliases per each application's usage. To take a step back even further, is there a way we can use something existing in ES to perform this work, possibly with routing rules etc? My primary concern is that we don't have to query across regions, can recover from a region or network outage, and that replication can begin once communication between regions is restored. I seem to be venturing into uncharted territory here by thinking I need to create a plugin, and I doubt I'm the first user to encounter such a problem. If there are any other known solutions, that would be great. I just need our replication time to be every N seconds. Thanks again! Todd On Friday, January 23, 2015 at 2:43:08 PM UTC-7, Jörg Prante wrote: > > This looks promising. > > For admin operations, see also the tribe node. A special > "replication-aware tribe node" (or maybe more than one tribe node for > resiliency) could supervise the cluster-to-cluster replication. > > For the segment strategy, I think it is hard to go down to the level of > the index store and capture the files properly and put it over the wire to > a target. It should be better to replicate on shard level. Maybe by reusing > some of the code > of org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService so > that a tribe node can trigger a snapshot action on the source cluster > master, open a transactional connection from a node in the source cluster > to a node in the target cluster, and place a restore action on a queue on > the target cluster master, plus a rollback logic if shard transaction > fails. So in short, the ES cluster to cluster replication process could be > realized by a "primary shard replication protocol". > > Just my 2¢ > > Jörg > > > On Fri, Jan 23, 2015 at 7:42 PM, Todd Nine > > wrote: > >> Thanks for the pointers Jorg, >> We use Rx Java in our current application, so I'm familiar with >> backpressure and ensuring we don't overwhelm target systems. I've been >> mulling over the high level design a bit more. A common approach in all >> systems that perform multi region replication is the concept of "log >> shipping". It's used heavily in SQL systems for replication, as well as in >> systems such as Megastore/HBase. This seems like it would be the most >> efficient way to ship data from Region A to Region B with a reasonable >> amount of latency. I was thinking something like the following. >> >> *Admin Operation Replication* >> >> This can get messy quickly. I'm thinking I won't have any sort of >> "merge" logic since this can get very different for everyone's use case. I >> was going to support broadcasting the following operations. >> >> >>- Index creation >>- Index deletion >>- Index mapping updates >>- Alias index addition >>- Alias index removal >> >> This can also get tricky because it makes the assumption of unique index >> operations in each region. Our indexes are Time UUID based, so I know we >> won't get conflicts. I won't handle the case of an operation being >> replayed that conflicts with an existing index, I'll simply log it and drop >> it. Handlers could be built in later so users could create their own >> resolution logic. Also, this must be replayed in a very strict order. I'm >> concerned that adding this additional master/master region communication >> could result in more load on the master. This can be solved by running a >> dedicated master, but I don't really see any other solution. >> >> >> *Data Replication* >> >> 1) Store last sent segments, probably in a system index. Each region >> could be offline at different times,
Re: Help creating a near real time streaming plugin to perform replication between clusters
Thanks for the pointers Jorg, We use Rx Java in our current application, so I'm familiar with backpressure and ensuring we don't overwhelm target systems. I've been mulling over the high level design a bit more. A common approach in all systems that perform multi region replication is the concept of "log shipping". It's used heavily in SQL systems for replication, as well as in systems such as Megastore/HBase. This seems like it would be the most efficient way to ship data from Region A to Region B with a reasonable amount of latency. I was thinking something like the following. *Admin Operation Replication* This can get messy quickly. I'm thinking I won't have any sort of "merge" logic since this can get very different for everyone's use case. I was going to support broadcasting the following operations. - Index creation - Index deletion - Index mapping updates - Alias index addition - Alias index removal This can also get tricky because it makes the assumption of unique index operations in each region. Our indexes are Time UUID based, so I know we won't get conflicts. I won't handle the case of an operation being replayed that conflicts with an existing index, I'll simply log it and drop it. Handlers could be built in later so users could create their own resolution logic. Also, this must be replayed in a very strict order. I'm concerned that adding this additional master/master region communication could result in more load on the master. This can be solved by running a dedicated master, but I don't really see any other solution. *Data Replication* 1) Store last sent segments, probably in a system index. Each region could be offline at different times, so for each segment I'll need to know where it's been sent. 2) Monitor segments as they're created. I still need to figure this out a bit more in the context of latent sending. Example. Region us-east-1 ES nodes. We missed sending 5 segments to us-west-1 , and they were merged into 1. I now only need to send the 1 merged segment to us-west-1, since the other 5 segments will be removed. However, then a merged segment is created in us-east-1 from 5 segments I've already sent to us-west-1, I won't want to ship that since it will already contain the data. As the tree is continually merged, I'll need to somehow sort out what contains shipped data, and what contains unshipped data. 3) As a new segment is created perform the following. 3.a) Replay any administrative operations since the last sync on the index to the target region, so the state is current. 3.b) Push the segment to the target region 4) The region receives the segment, and adds it to it's current segments. When a segment merge happens in the receiving region, this will get merged in. Thoughts? On Thursday, January 15, 2015 at 5:29:10 PM UTC-7, Jörg Prante wrote: > > While it seems quite easy to attach listeners to an ES node to capture > operations in translog-style and push out index/delete operations on shard > level somehow, there will be more to consider for a reliable solution. > > The Couchbase developers have added a data replication protocol to their > product which is meant for transporting changes over long distances with > latency for in-memory processing. > > To learn about the most important features, see > > https://github.com/couchbaselabs/dcp-documentation > > and > > http://docs.couchbase.com/admin/admin/Concepts/dcp.html > > I think bringing such a concept of an inter cluster protocol into ES could > be a good starting point, to sketch the complete path for such an ambitious > project beforehand. > > Most challenging could be dealing with back pressure when receiving > nodes/clusters are becoming slow. For a solution to this, reactive Java / > reactive streams look like a viable possibility. > > See also > > https://github.com/ReactiveX/RxJava/wiki/Backpressure > > http://www.ratpack.io/manual/current/streams.html > > I'm in favor of Ratpack since it comes with Java 8, Groovy, Google Guava, > and Netty, which has a resemblance to ES. > > In ES, for inter cluster communication, there is not much coded afaik, > except snapshot/restore. Maybe snapshot/restore can provide everything you > want, with incremental mode. Lucene will offer numbered segment files for > faster incremental snapshot/restore. > > Just my 2¢ > > Jörg > > > > On Thu, Jan 15, 2015 at 7:00 PM, Todd Nine > > wrote: > >> Hey all, >> I would like to create a plugin, and I need a hand. Below are the >> requirements I have. >> >> >>- Our documents are immutable. They are only ever created or >>d
Help creating a near real time streaming plugin to perform replication between clusters
Hey all, I would like to create a plugin, and I need a hand. Below are the requirements I have. - Our documents are immutable. They are only ever created or deleted, updates do not apply. - We want mirrors of our ES cluster in multiple AWS regions. This way if the WAN between regions is severed for any reason, we do not suffer an outage, just a delay in consistency. - As documents are added or removed they are rolled up then shipped in batch to the other AWS Regions. This can be a fast as a few milliseconds, or as slow as minutes, and will be user configurable. Note that a full backup+load is too slow, this is more of a near realtime operation. - This will sync the following operations. - Index creation/deletion - Alias creation/deletion - Document creation/deletion What I'm thinking architecturally. - The plugin is installed on each node in our cluster in all regions - The plugin will only gather changes for the primary shards on the local node - After the timeout elapses, the plugin will ship the changelog to the other AWS regions, where the plugin will receive it and process it Are there any api's I can look at that are a good starting point for developing this? I'd like to do a simple prototype with 2 1 node clusters reasonably soon. I found several plugin tutorials, but I'm more concerned with what part of the ES api I can call to receive events, if any. Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dff53da5-8a0c-4805-8f97-72844019a79e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Multi region replication help
Hey Mark, We're looking for a solution that's faster than a snapshot+restore. We do follow an eventual consistency model, but we'd like the replication to be near realtime, so a snapshot and restore would be too slow. We really need something that synchronizes between regions with some sort of log or queue, so that if a region goes down, we can replay all missed operations. We were going to put an SQS queue in our application tier, but if this is something an ES plugin can provide, we would prefer to utilize that over writing our own. I was searching the plugin api, but I can't seem to find anything that will receive post document write/delete events I can hook into. Ideally, I would like each node to queue up changes to primary shards. This way, we can distribute sending the changes among all our nodes and we won't have a single point of failure. Thanks, Todd On Wed, Jan 14, 2015 at 3:03 PM, Mark Walkom wrote: > You could use snapshot and restore, or even Logstash. > > On 15 January 2015 at 10:07, Todd Nine wrote: > >> Hi all, >> We have a deployment scenario I can't seem to find any examples of, and >> any help would be greatly appreciated. We're running ElasticSearch in 3 >> AWS regions. We want these regions to survive a failure from other >> regions, and we want all writes and reads from our clients to occur in the >> local region. Rather than have 1 large cluster that spans 3 regions, I >> would like the ability to asynchronously replicate document creation and >> deletion across the regions. Our application creates immutable documents, >> so update semantics don't apply. Is there any way this can be accomplished >> currently? Note that I don't think a river will provide us with the >> partition we need since it is a singleton. >> >> Thanks! >> Todd >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearch+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/874b63bf-4cf5-44c9-923d-33134cd4d234%40googlegroups.com >> <https://groups.google.com/d/msgid/elasticsearch/874b63bf-4cf5-44c9-923d-33134cd4d234%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/nYv8slw-0j4/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_fs2ykJsqvL9GVOn9dN-snh_dN0KkggtGH9mSOj0ha9w%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_fs2ykJsqvL9GVOn9dN-snh_dN0KkggtGH9mSOj0ha9w%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf9HzLB%3Ds9NnhCv0bRBXGdjFc%3DJqd_0MJJdR1vmHOH3USA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Multi region replication help
Hi all, We have a deployment scenario I can't seem to find any examples of, and any help would be greatly appreciated. We're running ElasticSearch in 3 AWS regions. We want these regions to survive a failure from other regions, and we want all writes and reads from our clients to occur in the local region. Rather than have 1 large cluster that spans 3 regions, I would like the ability to asynchronously replicate document creation and deletion across the regions. Our application creates immutable documents, so update semantics don't apply. Is there any way this can be accomplished currently? Note that I don't think a river will provide us with the partition we need since it is a singleton. Thanks! Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/874b63bf-4cf5-44c9-923d-33134cd4d234%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Internal implementation details when using geo hash
Hey All, I have a question about the internal implementation of geo hashes and distance filters. Here is my current understanding, I'm struggling to figure out how to apply these to our queries internally in ES. Using bool queries are very efficient. Internally they perform bitmap union, intersection, and subtraction for very fast candidate aggregation per term. Geo distance filters are then run on the results of the candidates from the bitmap logic. Each document must be evaluated individually in memory. Obviously for large documents sets from the bitmap evaluation, this is inefficient. What happens when someone only gives our application a geo distance query? To make this more efficient, I would like to use geo hashing. ES seems to have geo hashing built in, but it's documented as filter. For instance, I envision the following workflow internally in ES. 1) User searches for all matches within 2k of their current location 2) Use a geohash to create a hash that will encapsulate all points within 2k of their current location 3) Use the bool query with this geo hash to narrow the candidate result set 4) Apply the distance filter to these candidates to get more accurate results. However, when reading the documentation on searching geo hashing, it's still a filter. Internally, does it use geohasing and the fast bitmaps since it's a string match, then filter, or is it all filters and the hash is evaluated in memory for all documents? http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.4/query-dsl-geohash-cell-filter.html Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5ee6fb94-e0a2-4e39-8537-23c8c8f74fe0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Refresh on 1.4.0 not working as expected
Hey guys, We're testing ES 1.4.0 from 1.3.2. I'm noticing some strange behavior in our clients in our integration tests. They perform the following logic. Create an the first index in the cluster (single node) with a custom __default__ dynamic mapping Add 3 documents, each of a a new type to the index. egg, muffin, oj Perform a refresh on the new index Perform a match all query, with the types egg, muffin, and oj. I receive not responses Perform the same query above again, I receive responses, and I receive all documents consistently after refresh. This only seems to be an issue when the following is true. On subsequent tests, I don't see this behavior. 1) Its the first time refresh is called on the first index created in the system 2) The index uses default type mappings If I roll back to 1.3.2, I don't see this. I haven't tried 1.3.5. Has something changed internally on index refresh? Note that we use the refresh operation sparingly and mostly in rare admin functions of our system, or in integration testing. We rarely use this in production. Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5d1c2295-1627-427e-a5a4-ba9764bb938a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: NPE from server when using query + geo filter + sort
You're right Jorg. There was an issue where types was incorrectly set to null because more than 1 was specified. As a result, it passed the check for at least 1 element in the array, even though the type itself in element 0 was null. Thank you for your help! On Tuesday, November 11, 2014 12:48:22 PM UTC-7, Jörg Prante wrote: > > Looks like a bug in org.apache.usergrid.persistence.index.impl. > EsEntityIndexImpl > > Check if the "types" are set to a non-null value in the SearchRequest. If > you force them to be a null value, SearchRequest will throw the NPE you > posted. > > Jörg > > On Tue, Nov 11, 2014 at 6:11 PM, Todd Nine > > wrote: > >> Hi all, >> I'm getting some strange behavior from the ES server when using a term >> query + a geo distance filter + a sort. I've tried this with 1.3.2, >> 1.3.5, as well as 1.4.0. All exhibit this same behavior. I'm using the >> Java transport client. Here is my SearchRequestBuilder payload in >> toString() format. >> >> >> query { "from" : 0, "size" : 10, "query" : { "term" : { >> "ug_context" : >> "2abaf75a-69c4-11e4-918e-4658cde16dad__application__zzzcollzzzusers" } >> }, "post_filter" : { "geo_distance" : { "go_location" : [ >> -122.40474700927734, 37.77427673339844 ], "distance" : "200.0m" } >> }, "sort" : [ { "su_created" : { "order" : "asc", >> "ignore_unmapped" : true } }, { "nu_created" : { "order" : >> "asc", "ignore_unmapped" : true } }, { "bu_created" : { >> "order" : "asc", "ignore_unmapped" : true } } ] } limit 10 >> >> >> Here is the stack trace >> >> org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: >> Failed execution >> at >> org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(AdapterActionFuture.java:90) >> at >> org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:50) >> at >> org.apache.usergrid.persistence.index.impl.EsEntityIndexImpl.search(EsEntityIndexImpl.java:307) >> at >> org.apache.usergrid.corepersistence.CpRelationManager.searchConnectedEntities(CpRelationManager.java:1453) >> at >> org.apache.usergrid.corepersistence.CpEntityManager.searchConnectedEntities(CpEntityManager.java:1585) >> at org.apache.usergrid.persistence.GeoIT.testGeo(GeoIT.java:202) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at >> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) >> at >> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) >> at >> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) >> at >> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) >> at >> org.apache.usergrid.CoreApplication$1.evaluate(CoreApplication.java:133) >> at org.junit.rules.RunRules.evaluate(RunRules.java:20) >> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) >> at >> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) >> at >> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) >> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) >> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) >> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) >> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) >> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) >> at org.apache.usergrid.CoreITSetupImpl$1.evaluate(CoreITSetupImpl.java:66) >> at org.junit.rules.RunRules.evaluate(RunRules.java:20) >> at org.junit.runners.ParentRunner.run(ParentRunner.java:309) >> at org.junit.runner.JUnitCore.run(JUnitCore.java:160) >> at >> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) >> at >> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(
Re: NPE from server when using query + geo filter + sort
I just noticed I had a typo in my query, this is the query payload I'm executing if I run it in HTTP (which works) { "from" : 0, "size" : 10, "query" : { "term" : { "ug_context" : "c2d2d78a-69cc-11e4-b22e-81db7b9aa660__user__zzzconnzzzlikes" } }, "post_filter" : { "geo_distance" : { "go_location" : [ -122.40474700927734, 37.77427673339844 ], "distance" : "200.0m" } }, "sort" : [ { "su_created" : { "order" : "asc", "ignore_unmapped" : true } }, { "nu_created" : { "order" : "asc", "ignore_unmapped" : true } }, { "bu_created" : { "order" : "asc", "ignore_unmapped" : true } } ] } On Tuesday, November 11, 2014 10:11:46 AM UTC-7, Todd Nine wrote: > > Hi all, > I'm getting some strange behavior from the ES server when using a term > query + a geo distance filter + a sort. I've tried this with 1.3.2, > 1.3.5, as well as 1.4.0. All exhibit this same behavior. I'm using the > Java transport client. Here is my SearchRequestBuilder payload in > toString() format. > > > query { "from" : 0, "size" : 10, "query" : { "term" : { > "ug_context" : > "2abaf75a-69c4-11e4-918e-4658cde16dad__application__zzzcollzzzusers" } > }, "post_filter" : { "geo_distance" : { "go_location" : [ > -122.40474700927734, 37.77427673339844 ], "distance" : "200.0m" } > }, "sort" : [ { "su_created" : { "order" : "asc", > "ignore_unmapped" : true } }, { "nu_created" : { "order" : > "asc", "ignore_unmapped" : true } }, { "bu_created" : { > "order" : "asc", "ignore_unmapped" : true } } ] } limit 10 > > > Here is the stack trace > > org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: > Failed execution > at > org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(AdapterActionFuture.java:90) > at > org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:50) > at > org.apache.usergrid.persistence.index.impl.EsEntityIndexImpl.search(EsEntityIndexImpl.java:307) > at > org.apache.usergrid.corepersistence.CpRelationManager.searchConnectedEntities(CpRelationManager.java:1453) > at > org.apache.usergrid.corepersistence.CpEntityManager.searchConnectedEntities(CpEntityManager.java:1585) > at org.apache.usergrid.persistence.GeoIT.testGeo(GeoIT.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.apache.usergrid.CoreApplication$1.evaluate(CoreApplication.java:133) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentR
NPE from server when using query + geo filter + sort
Hi all, I'm getting some strange behavior from the ES server when using a term query + a geo distance filter + a sort. I've tried this with 1.3.2, 1.3.5, as well as 1.4.0. All exhibit this same behavior. I'm using the Java transport client. Here is my SearchRequestBuilder payload in toString() format. query { "from" : 0, "size" : 10, "query" : { "term" : { "ug_context" : "2abaf75a-69c4-11e4-918e-4658cde16dad__application__zzzcollzzzusers" } }, "post_filter" : { "geo_distance" : { "go_location" : [ -122.40474700927734, 37.77427673339844 ], "distance" : "200.0m" } }, "sort" : [ { "su_created" : { "order" : "asc", "ignore_unmapped" : true } }, { "nu_created" : { "order" : "asc", "ignore_unmapped" : true } }, { "bu_created" : { "order" : "asc", "ignore_unmapped" : true } } ] } limit 10 Here is the stack trace org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution at org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(AdapterActionFuture.java:90) at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:50) at org.apache.usergrid.persistence.index.impl.EsEntityIndexImpl.search(EsEntityIndexImpl.java:307) at org.apache.usergrid.corepersistence.CpRelationManager.searchConnectedEntities(CpRelationManager.java:1453) at org.apache.usergrid.corepersistence.CpEntityManager.searchConnectedEntities(CpEntityManager.java:1585) at org.apache.usergrid.persistence.GeoIT.testGeo(GeoIT.java:202) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.apache.usergrid.CoreApplication$1.evaluate(CoreApplication.java:133) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.apache.usergrid.CoreITSetupImpl$1.evaluate(CoreITSetupImpl.java:66) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.junit.runner.JUnitCore.run(JUnitCore.java:160) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67) Caused by: java.lang.NullPointerException at org.elasticsearch.common.io.stream.StreamOutput.writeString(StreamOutput.java:223) at org.elasticsearch.common.io.stream.HandlesStreamOutput.writeString(HandlesStreamOutput.java:55) at org.elasticsearch.common.io.stream.StreamOutput.writeStringArray(StreamOutput.java:298) at org.elasticsearch.action.search.SearchRequest.writeTo(SearchRequest.java:615) at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:685) at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199) at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:57) at org.elasticsearch.client.transport.support.InternalTransportClient$1.doWithNode(InternalTransportClient.java:109) at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:202) at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106) at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:334) at org.elasticsearch.client.transport.TransportClient.search(TransportClient.java:424) at org.elasticsearch.action.search.SearchRequestBuilder.doExecute(SearchRequestBuilder.java:1116) at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91) at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65) I've observed a couple of things. I don't
Help with a larger cluster in EC2 and Node clients failing to join
Hey guys, We're running some load tests, and we're finding some of our clients are having issues joining the cluster. We have the following setup running in EC2. 24 c3.4xlarge ES instances. These instances are our data storage and search instances. 40 c3.xlarge Tomcat instances. These are our ephemeral webapp instances. In our tomcat instance, we're using the Node java client. What I'm finding is that as the cluster comes up, only about 36~38 of the tomcat nodes ever discover the 24 nodes that are actually storing data. We're using tcp discovery, and have the following code in our client. Settings settings = ImmutableSettings.settingsBuilder() .put( "cluster.name", fig.getClusterName() ) // this assumes that we're using zen for host discovery. Putting an // explicit set of bootstrap hosts ensures we connect to a valid cluster. .put( "discovery.zen.ping.unicast.hosts", allHosts ) .put( "discovery.zen.ping.multicast.enabled", "false" ).put( "http.enabled", false ) .put( "client.transport.ping_timeout", 2000 ) // milliseconds .put( "client.transport.nodes_sampler_interval", 100 ) .put( "network.tcp.blocking", true ) .put( "node.client", true ) .put( "node.name", nodeName ) .build(); log.debug( "Creating ElasticSearch client with settings: " + settings.getAsMap() ); where allHosts = "ip1:port, ip2:port" etc etc. Not sure what we're doing wrong. We have retry code so that if a client fails 20 consecutive times with a TransportException, the node instance is stopped, and a new instance is created. This has worked in dealing with network hiccups in the past, however we can just never seem to get past the 38 client mark. Any ideas what we may be doing wrong? Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/55924250-3013-4f70-b3d2-b5381f8af993%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Help with profiling our code's usage of the Node java client using YourKit
Hi All, I've been using ES for a while and I'm really enjoying it, but we have a few slow calls in our code. I have a few hunches around the code that's using the client inefficiently, but I would like some definitive proof. I've attempted to profile our application using YourKit when we're load testing, and I've found we're spending 30-35% of our "wall time" in the Netty classes of the ES Node client. However, I can't tell where the call is originating from in our source, since the calls are asynchronous and I can't capture the stack. Are there any debugging options I can turn on in the client that will help me with profiling so that I can track down the code in our system that's not using our client efficiently? Any help would be greatly appreciated! Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/332d8e12-f045-4ff7-99aa-889490002b86%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Help with designing our document for graphs. Indexing single nodes in graph with thousands of incoming edges
Hey Jorg, Thanks for your response, it's very helpful. We're taking a similar approach now, however it is slightly different. We're using Cassandra as our edge storage, and we won't be moving to Elastic Search for a bit. We're very operationally familiar with Cassandra, and Elastic Search is a new beast to us. We're only going to use it for secondary indexing until we become more comfortable with it, then we're going to transition more use cases to it. In our current implementation, entities contained within types of edges are indexed by target type and and edge type into elastic search. In the example I gave above, we actually index 3 documents, all of which are the same, except for the type to make seek more efficient. { docId: e1-v1 name: "duo", openTime: 9, closeTime: 18 _type: "george/likes" } { docId: e1-v1 name: "duo", openTime: 9, closeTime: 18 _type: "dave/likes" } { docId: e1-v1 name: "duo", openTime: 9, closeTime: 18 _type: "rod/likes" } We then search within the type of "dave/likes" for all restaurants dave likes. We end up with a lot more documents this way, but we won't hit the document size issues we've discussed. What sort of recommendations do you feel we should have for shard size? Right now we're just sticking with the default 10, with a replica count of 2, and we seem to be doing well. Ultimately we're going to change our client to point to an alias, and add more indexes behind the alias as we expand. Most apps will never need to be more than 10 shards, a handful will need to expand into a few indexes. Thoughts on this implementation? Thanks, Todd On Tue, Oct 7, 2014 at 5:19 AM, joergpra...@gmail.com wrote: > Unfortunately, adding edges per update script will soon become expensive. > Note, updating is a multistep process of reading the doc, looking up the > field (often by fetching _source), and reindexing the whole(!) document > (not only the new edge) plus the versioning conflict management in case you > run concurrent updates. Also, this is the same procedure for removing an > edge. This is a huge difference to graph algorithms, where it is very cheap > to add/remove edges. Script updates will work to a certain extent quite > satisfactory, but you are right, if you want to add millions of edges to an > ES doc one by one, this will not be efficient. > > So I would like to suggest to avoid the overhead of updating fields by > script in preference to add / remove relations by their "relation id", i.e. > to treat relations as first citizen docs. Adding millions of docs to an ES > index is cheaper than a million scripted updates on a single field. > > Jörg > > > On Tue, Oct 7, 2014 at 1:23 AM, Todd Nine wrote: > >> Hi Jorg, >> Thanks for the response. I don't actually need to model the >> relationship per se, more that a document is used in a relationship via a >> filter, then search on it's properties. See the example below for more >> clarity. >> >> >> Restaurant: => {name: "duo"} >> >> Now, lets say I have 3 users, >> >> George, Dave and Rod >> >> George Dave and Rod all "like" the restaurant Duo. These are directed >> edges from the user, of type "likes" to the "duo" document. We store these >> edges in Cassandra. Envision the document looking something like this. >> >> >> { >> name: "duo", >> openTime: 9, >> closeTime: 18 >> _in_edges: [ "george/likes", "dave/likes", "rod/likes" ] >> } >> >> Then when searching, the user Dave would search something like this. >> >> select * where closeTime < 16 >> >> >> Which we translate in to a query, which is then also filtered by >> _in_edges = "dave/likes". >> >> Our goal is to only create 1 document per node in our graph (in this >> example restaurant), then possibly use the scripting API to add and remove >> elements to the _in_edges fields and update the document. My only concern >> around this is document size. It's not clear to me how to go about this >> when we start getting millions of edges to that same target node, or >> _in_edges field could grow to be millions of fields long. At that point, >> is it more efficient to de-normalize and just turn "dave/likes", >> "rod/likes", and "george/likes" into document types and store multiple >> copies? >> >> Thanks, >> Todd >> >> >> >> >> >> >> >> >> On Sat, Oct 4, 2014 at 2:52 AM, joergpra...@gmail.com <
Re: Help with designing our document for graphs. Indexing single nodes in graph with thousands of incoming edges
Hi Jorg, Thanks for the response. I don't actually need to model the relationship per se, more that a document is used in a relationship via a filter, then search on it's properties. See the example below for more clarity. Restaurant: => {name: "duo"} Now, lets say I have 3 users, George, Dave and Rod George Dave and Rod all "like" the restaurant Duo. These are directed edges from the user, of type "likes" to the "duo" document. We store these edges in Cassandra. Envision the document looking something like this. { name: "duo", openTime: 9, closeTime: 18 _in_edges: [ "george/likes", "dave/likes", "rod/likes" ] } Then when searching, the user Dave would search something like this. select * where closeTime < 16 Which we translate in to a query, which is then also filtered by _in_edges = "dave/likes". Our goal is to only create 1 document per node in our graph (in this example restaurant), then possibly use the scripting API to add and remove elements to the _in_edges fields and update the document. My only concern around this is document size. It's not clear to me how to go about this when we start getting millions of edges to that same target node, or _in_edges field could grow to be millions of fields long. At that point, is it more efficient to de-normalize and just turn "dave/likes", "rod/likes", and "george/likes" into document types and store multiple copies? Thanks, Todd On Sat, Oct 4, 2014 at 2:52 AM, joergpra...@gmail.com wrote: > Not sure if this helps but I use a variant of graphs in ES, it is called > Linked Data (JSON-LD) > > By using JSON-LD, you can index something like > > doc index: graph > doc type: relations > doc id: ... > > { >"user" : { > "id" : "...", > "label" : "Bob", > "likes" : "restaurant:Duo" > } > } > > for the statement "Bob likes restaurant Duo" > > and then you can run ES queries on the field "likes" or better > "user.likes" for finding the users that like a restaurant etc. Referencing > the "id" it is possible to lookup another document in another index about > "Bob". > > Just to give an idea how you can model relations in structured ES JSON > objects. > > Jörg > > > On Fri, Oct 3, 2014 at 7:59 PM, Todd Nine wrote: > >> So clearly I need to RTFM. I missed this in the documentation the first >> time. >> >> >> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/mapping.html#_how_types_are_implemented >> >> Will filters at this scale be fast enough? >> >> >> >> On Friday, October 3, 2014 11:48:40 AM UTC-6, Todd Nine wrote: >>> >>> Hey guys, >>> We're currently storing entities and edges in Cassandra. The entities >>> are JSON, and edges are directed edges with a source---type-->target. >>> We're using ElasticSearch for indexing and I could really use a hand with >>> design. >>> >>> What we're doing currently, is we take an entity, and turn it's JSON >>> into a document. We then create multiple copies of our document and change >>> it's type to match the index. For instance, Image the following use case. >>> >>> >>> bob(user) -- likes -- > Duo (restaurant) ===> Document Type = >>> bob(user) + likes + restaurant ; bob(user) + likes >>> >>> >>> bob(user) -- likes -> Root Down (restaurant) ===> Document Type = >>> bob(user) + likes+ restaurant ; bob(user) + likes >>> >>> bob(user) -- likes --> Coconut Porter (beer). ===> Document Types = >>> bob(user) + likes + beer; bob(user) + likes >>> >>> >>> When we index using this scheme we create 3 documents based on the >>> restaurants Duo and Root Down, and the beer Coconut Porter. We then store >>> this document 2x, one for it's specific type, and one in the "all" bucket. >>> >>> Essentially, the document becomes a node in the graph. For each >>> incoming directed edge, we're storing 2x documents and changing the type. >>> This gives us fast seeks when we search by type, but a LOT of data bloat. >>> Would it instead be more efficient to keep an array of incoming edges in >>> the document, then add it to our search terms? For instance, should we >>> instead have a document like this? >>> >>> >>> docId: Duo(restaurant) >>> >>
Re: Help with designing our document for graphs. Indexing single nodes in graph with thousands of incoming edges
So clearly I need to RTFM. I missed this in the documentation the first time. http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/mapping.html#_how_types_are_implemented Will filters at this scale be fast enough? On Friday, October 3, 2014 11:48:40 AM UTC-6, Todd Nine wrote: > > Hey guys, > We're currently storing entities and edges in Cassandra. The entities > are JSON, and edges are directed edges with a source---type-->target. > We're using ElasticSearch for indexing and I could really use a hand with > design. > > What we're doing currently, is we take an entity, and turn it's JSON into > a document. We then create multiple copies of our document and change it's > type to match the index. For instance, Image the following use case. > > > bob(user) -- likes -- > Duo (restaurant) ===> Document Type = bob(user) > + likes + restaurant ; bob(user) + likes > > > bob(user) -- likes -> Root Down (restaurant) ===> Document Type = > bob(user) + likes+ restaurant ; bob(user) + likes > > bob(user) -- likes --> Coconut Porter (beer). ===> Document Types = > bob(user) + likes + beer; bob(user) + likes > > > When we index using this scheme we create 3 documents based on the > restaurants Duo and Root Down, and the beer Coconut Porter. We then store > this document 2x, one for it's specific type, and one in the "all" bucket. > > Essentially, the document becomes a node in the graph. For each incoming > directed edge, we're storing 2x documents and changing the type. This > gives us fast seeks when we search by type, but a LOT of data bloat. Would > it instead be more efficient to keep an array of incoming edges in the > document, then add it to our search terms? For instance, should we instead > have a document like this? > > > docId: Duo(restaurant) > > edges: [ "bob(user) + likes + restaurant", "bob(user) + likes" ] > > When searching where edges = "bob(user) + likes + restaurant"? > > > I don't know internally what specifying type actually does, if it just > treats it as as field, or if it changes the routing of the response?In > a social situation millions of people can be connected to any one entity, > so we have to have a scheme that won't fall over when we get to that case. > > Any help would be greatly appreciated! > > Thanks, > Todd > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f97c6475-f4fc-4078-b052-b497ac82dc91%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Help with designing our document for graphs. Indexing single nodes in graph with thousands of incoming edges
Hey guys, We're currently storing entities and edges in Cassandra. The entities are JSON, and edges are directed edges with a source---type-->target. We're using ElasticSearch for indexing and I could really use a hand with design. What we're doing currently, is we take an entity, and turn it's JSON into a document. We then create multiple copies of our document and change it's type to match the index. For instance, Image the following use case. bob(user) -- likes -- > Duo (restaurant) ===> Document Type = bob(user) + likes + restaurant ; bob(user) + likes bob(user) -- likes -> Root Down (restaurant) ===> Document Type = bob(user) + likes+ restaurant ; bob(user) + likes bob(user) -- likes --> Coconut Porter (beer). ===> Document Types = bob(user) + likes + beer; bob(user) + likes When we index using this scheme we create 3 documents based on the restaurants Duo and Root Down, and the beer Coconut Porter. We then store this document 2x, one for it's specific type, and one in the "all" bucket. Essentially, the document becomes a node in the graph. For each incoming directed edge, we're storing 2x documents and changing the type. This gives us fast seeks when we search by type, but a LOT of data bloat. Would it instead be more efficient to keep an array of incoming edges in the document, then add it to our search terms? For instance, should we instead have a document like this? docId: Duo(restaurant) edges: [ "bob(user) + likes + restaurant", "bob(user) + likes" ] When searching where edges = "bob(user) + likes + restaurant"? I don't know internally what specifying type actually does, if it just treats it as as field, or if it changes the routing of the response?In a social situation millions of people can be connected to any one entity, so we have to have a scheme that won't fall over when we get to that case. Any help would be greatly appreciated! Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84b745c3-686a-4e9e-a02a-2816f90a23a1%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Upper bounds on the number of indexes in an elastic search cluster
It sounds like we're going to need to test our upper bounds of indexes (with no data) to see how many we can support. We may need to re-evaluate our thoughts on an index per app. We might be better off doing a statically sized set of indexes, then consistently hashing our applications to those indexes. Thank you for your help! On Fri, Sep 26, 2014 at 5:44 PM, joergpra...@gmail.com < joergpra...@gmail.com> wrote: > If you consider tens of thousands of indices on tens of thousands of > nodes, and the master node is the only node that can write to the cluster > state, it will have lot of work to do to keep up with all cluster state > updates. > > When the rate of changes to the cluster state increases, the master node > will be challenged to propagate the state changes to all other nodes > reliably and fast enough. There is no "distributed tree" yet, for example, > there are no special "forwarder nodes" that communicate the cluster state > to a partitioned set of nodes. > > See https://github.com/elasticsearch/elasticsearch/issues/6186 > > The cluster state is a big compressed JSON structure which also must fit > into the heap memory of master eligible nodes. ES also uses privileged > network traffic channels for cluster state communication to take precedence > over ordinary index/search messaging. But all these precautions may not be > enough at some point. You can observe the point by retrieving a growing > cluster state over the cluster state API and measure the size and time of > this request. > > On the other hand, you can have calm cluster state and many thousand > indices when the type mappings are constant, no field updates occur, and no > nodes connect/disconnect. It all depends on the situation how you have to > operate the cluster. One important thing is to allocate enough resources to > master eligible data-less nodes so they are not hindered by extra > search/index load. > > N.B. 20 nodes is not a big cluster. There are ES clusters of hundreds and > thousands of nodes. From my understanding, the communication of the master > and 20 nodes is not a serious problem. This becomes an issue at ~500-1000 > nodes. > > Jörg > > > > > On Sat, Sep 27, 2014 at 1:12 AM, Todd Nine wrote: > >> Hi Jorg, >> We're storing each application in it's own Index so we can manage it >> independently of others. There's not set load or usage on our >> applications. Some will be very small, a few hundred documents. Others >> will be quite large, in the billions. We have no way of knowing what the >> usage profile is. Rather, we're initially thinking that expansion will >> occur using a combination of additional indexes and aliases referencing >> those indexes. This allows us to automate the management of the aliases >> and indexes, and in turn allows us to scale them to the needs of the >> application without over allocating unused capacity. For instance, with >> write heavy applications we can allocate more shards (via alias), for read >> heavy application we can allocate more replicas. >> >> >> >> We're not running our cluster on a single node. Our cluster is small to >> begin with, it's 6 nodes in our current POC. Ultimately I expect us to >> grow each cluster we stand up to 20 or so nodes. We'll expand as necessary >> to support the number of shards and replicas and keep our performance up. >> I'm not particularly worried about our ability to scale horizontally with >> our hardware. >> >> Rather, I'm concerned with how far can we scale on our number of indexes, >> and how does that relate to the number of machines? When we keep adding >> hardware, does this increase the upper bounds of the number of indexes we >> can have? Not the physical shards and replicas, but the routing >> information for the master of the shards and location of the replicas. >> I've done distributed data storage for many years, and none of the >> documentation on ES makes it clear if this becomes an issue operationally. >> I'm leery to just assume it will "just work". When implementing something >> like this, you either have to do a distributed tree for your meta data to >> get the partitioning you need to scale infinitely, or every node must store >> every shard's master information. How does it work in ES? >> >> >> Thanks, >> Todd >> >> >> >> On Friday, September 26, 2014 4:29:53 PM UTC-6, Jörg Prante wrote: >>> >>> Why do you want to create huge number of indexes on just a single node? >>> >>> There are smarter method
Re: Upper bounds on the number of indexes in an elastic search cluster
Hi Jorg, We're storing each application in it's own Index so we can manage it independently of others. There's not set load or usage on our applications. Some will be very small, a few hundred documents. Others will be quite large, in the billions. We have no way of knowing what the usage profile is. Rather, we're initially thinking that expansion will occur using a combination of additional indexes and aliases referencing those indexes. This allows us to automate the management of the aliases and indexes, and in turn allows us to scale them to the needs of the application without over allocating unused capacity. For instance, with write heavy applications we can allocate more shards (via alias), for read heavy application we can allocate more replicas. We're not running our cluster on a single node. Our cluster is small to begin with, it's 6 nodes in our current POC. Ultimately I expect us to grow each cluster we stand up to 20 or so nodes. We'll expand as necessary to support the number of shards and replicas and keep our performance up. I'm not particularly worried about our ability to scale horizontally with our hardware. Rather, I'm concerned with how far can we scale on our number of indexes, and how does that relate to the number of machines? When we keep adding hardware, does this increase the upper bounds of the number of indexes we can have? Not the physical shards and replicas, but the routing information for the master of the shards and location of the replicas. I've done distributed data storage for many years, and none of the documentation on ES makes it clear if this becomes an issue operationally. I'm leery to just assume it will "just work". When implementing something like this, you either have to do a distributed tree for your meta data to get the partitioning you need to scale infinitely, or every node must store every shard's master information. How does it work in ES? Thanks, Todd On Friday, September 26, 2014 4:29:53 PM UTC-6, Jörg Prante wrote: > > Why do you want to create huge number of indexes on just a single node? > > There are smarter methods to scale. Use over-allocation of shards. This is > explained by kimchy in this thread > > > http://elasticsearch-users.115913.n3.nabble.com/Over-allocation-of-shards-td3673978.html > > <http://www.google.com/url?q=http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FOver-allocation-of-shards-td3673978.html&sa=D&sntz=1&usg=AFQjCNEk7KTtpuEot3JtBBmMRMpH25vLDA> > > TL;DR you can create many thousands of aliases on a single (or few) > indices with just a few shards. There is no limit defined by ES, when your > configuration / hardware capacity is exceeded, you will see the node > getting sluggish. > > Jörg > > On Fri, Sep 26, 2014 at 11:23 PM, Todd Nine > wrote: > >> Hey guys. We’re building a Multi tenant application, where users create >> applications within our single server.For our current ES scheme, we're >> building an index per application. Are there any stress tests or >> documentation on the upper bounds of the number of indexes a cluster can >> handle? From my current understanding of meta data and routing, ever node >> caches the meta data of all the indexes and shards for routing. At some >> point, this will obviously overwhelm the node. Is my current understanding >> correct, or is this information partitioned across the cluster as well? >> >> >> Thanks, >> >> Todd >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com . >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/200e1ce7-c56f-49d4-9c02-4b1dcc570bf2%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/200e1ce7-c56f-49d4-9c02-4b1dcc570bf2%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/595db4ab-b2f3-4dfe-bbf0-e4c13926e75e%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Upper bounds on the number of indexes in an elastic search cluster
Hey guys. We’re building a Multi tenant application, where users create applications within our single server.For our current ES scheme, we're building an index per application. Are there any stress tests or documentation on the upper bounds of the number of indexes a cluster can handle? From my current understanding of meta data and routing, ever node caches the meta data of all the indexes and shards for routing. At some point, this will obviously overwhelm the node. Is my current understanding correct, or is this information partitioned across the cluster as well? Thanks, Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/200e1ce7-c56f-49d4-9c02-4b1dcc570bf2%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Shard count and plugin questions
Hey Mark, Thanks for this. It seems like our best bet will be to manage indexes the same across all regions, since they're really mirrors. Since our documents are immutable, we'll just queue them up for each region, which will insert or delete them into their index in the region. It's the only solution I can think of, since we want to continue processing when WAN connections go down. On Tue, Jun 10, 2014 at 3:24 PM, Mark Walkom wrote: > There are a few people in the IRC channel that have done it, however, > generally, cross-WAN clusters are not recommended as ES is sensitive to > latency. > > You may be better off using the snapshot/restore process, or another > export/import method. > > Regards, > Mark Walkom > > Infrastructure Engineer > Campaign Monitor > email: ma...@campaignmonitor.com > web: www.campaignmonitor.com > > > On 11 June 2014 03:11, Todd Nine wrote: > >> Hey guys, >> One last question. Does anyone do multi region replication with ES? >> My current understanding is that with a multi region cluster, documents >> will be routed to the Region with a node that "owns" the shard the document >> is being written to. In our use cases, our cluster must survive a WAN >> outage. We don't want the latency of the writes or reads crossing the WAN >> connection. Our documents are immutable, so we can work with multi region >> writes. We simply need to replicate the write to other regions, as well as >> the deletes. Are there any examples or implementations of this? >> >> Thanks, >> Todd >> >> On Thursday, June 5, 2014 4:11:44 PM UTC-6, Jörg Prante wrote: >>> >>> Yes, routing is very powerful. The general use case is to introduce a >>> mapping to a large number of shards so you can store parts of data all at >>> the same shard which is good for locality concepts. For example, combined >>> with index alias working on filter terms, you can create one big concrete >>> index, and use segments of it, so many thousands of users can share a >>> single index. >>> >>> Another use case for routing might be a time windowed index where each >>> shard holds a time window. There are many examples around logstash. >>> >>> The combination of index alias and routing is also known as "shard >>> overallocation". The concept might look complex first but the other option >>> would be to create a concrete index for every user which might also be a >>> waste of resources. >>> >>> Though some here on this list have managed to run 1s of shards on a >>> single node I still find this breathtaking - a few dozens of shards per >>> node should be ok. Each shard takes some MB on the heap (there are tricks >>> to reduce this a bit) but a high number of shards takes a handful of >>> resources even without executing a single query. There might be other >>> factors worth considering, for example a size limit for a single shard. It >>> can be quite handy to let ES having move around shards of 1-10 GB instead >>> of a few 100 GB - it is faster at index recovery or at reallocation time. >>> >>> Jörg >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Jun 5, 2014 at 9:44 PM, Todd Nine wrote: >>> >>>> Hey Jorg, >>>> Thanks for the reply. We're using Cassandra heavily in production, >>>> I'm very familiar with the scale out out concepts. What we've seen in all >>>> our distributed systems is that at some point, you reach a saturation of >>>> your capacity for a single node. In the case of ES, to me that would seem >>>> to be shard count. Eventually, all 5 shards can become too large for a >>>> node to handle updates and reads efficiently. This can be caused by a high >>>> number of documents or document size, or both. Once we reach this state, >>>> that index is "full" in the sense that the nodes containing these can no >>>> longer continue to service traffic at the rate we need it to. We have 2 >>>> options. >>>> >>>> 1) Get bigger hardware. We do this occasionally, but not ideal since >>>> this is a distributed system. >>>> >>>> 2) Scale out, as you said. In the case of write throughput it seems >>>> that we can do this with a pattern of alias + new index, but it's not clear >>>> to me if that's the right approach. My initial thinking is to define some >>>> sort of routing that
Re: Shard count and plugin questions
Hey guys, One last question. Does anyone do multi region replication with ES? My current understanding is that with a multi region cluster, documents will be routed to the Region with a node that "owns" the shard the document is being written to. In our use cases, our cluster must survive a WAN outage. We don't want the latency of the writes or reads crossing the WAN connection. Our documents are immutable, so we can work with multi region writes. We simply need to replicate the write to other regions, as well as the deletes. Are there any examples or implementations of this? Thanks, Todd On Thursday, June 5, 2014 4:11:44 PM UTC-6, Jörg Prante wrote: > > Yes, routing is very powerful. The general use case is to introduce a > mapping to a large number of shards so you can store parts of data all at > the same shard which is good for locality concepts. For example, combined > with index alias working on filter terms, you can create one big concrete > index, and use segments of it, so many thousands of users can share a > single index. > > Another use case for routing might be a time windowed index where each > shard holds a time window. There are many examples around logstash. > > The combination of index alias and routing is also known as "shard > overallocation". The concept might look complex first but the other option > would be to create a concrete index for every user which might also be a > waste of resources. > > Though some here on this list have managed to run 1s of shards on a > single node I still find this breathtaking - a few dozens of shards per > node should be ok. Each shard takes some MB on the heap (there are tricks > to reduce this a bit) but a high number of shards takes a handful of > resources even without executing a single query. There might be other > factors worth considering, for example a size limit for a single shard. It > can be quite handy to let ES having move around shards of 1-10 GB instead > of a few 100 GB - it is faster at index recovery or at reallocation time. > > Jörg > > > > > > > > On Thu, Jun 5, 2014 at 9:44 PM, Todd Nine > > wrote: > >> Hey Jorg, >> Thanks for the reply. We're using Cassandra heavily in production, I'm >> very familiar with the scale out out concepts. What we've seen in all our >> distributed systems is that at some point, you reach a saturation of your >> capacity for a single node. In the case of ES, to me that would seem to be >> shard count. Eventually, all 5 shards can become too large for a node to >> handle updates and reads efficiently. This can be caused by a high number >> of documents or document size, or both. Once we reach this state, that >> index is "full" in the sense that the nodes containing these can no longer >> continue to service traffic at the rate we need it to. We have 2 options. >> >> 1) Get bigger hardware. We do this occasionally, but not ideal since >> this is a distributed system. >> >> 2) Scale out, as you said. In the case of write throughput it seems that >> we can do this with a pattern of alias + new index, but it's not clear to >> me if that's the right approach. My initial thinking is to define some >> sort of routing that pivots on created date to the new index since that's >> an immutable field. Thoughts? >> >> In the case of read throughput, we an create more replicas. Our systems >> is about 50/50 now, some users are even read/write, others are very read >> heavy. I'll probably come up with 2 indexing strategies we can apply to an >> application's index based on the heuristics from the operations they're >> performing. >> >> >> Thanks for the feedback! >> Todd >> >> >> >> >> On Thu, Jun 5, 2014 at 10:55 AM, joerg...@gmail.com < >> joerg...@gmail.com > wrote: >> >>> Thanks for raising the questions, I will come back later in more detail. >>> >>> Just a quick note, the idea about "shards scale write" and "replica >>> scale read" is correct, but Elasticsearch is also "elastic" which means it >>> "scales out", by adding node hardware. The shard/replica scale pattern >>> finds its limits in a node hardware, because shards/replica are tied to >>> machines, and there are the hard resource constraints, mostly disk I/O and >>> memory related. >>> >>> In the end, you can take as a rule of thumb: >>> >>> - add replica to scale "read" load >>> - add new
Re: Shard count and plugin questions
Hey Jorg, Thanks for the reply. We're using Cassandra heavily in production, I'm very familiar with the scale out out concepts. What we've seen in all our distributed systems is that at some point, you reach a saturation of your capacity for a single node. In the case of ES, to me that would seem to be shard count. Eventually, all 5 shards can become too large for a node to handle updates and reads efficiently. This can be caused by a high number of documents or document size, or both. Once we reach this state, that index is "full" in the sense that the nodes containing these can no longer continue to service traffic at the rate we need it to. We have 2 options. 1) Get bigger hardware. We do this occasionally, but not ideal since this is a distributed system. 2) Scale out, as you said. In the case of write throughput it seems that we can do this with a pattern of alias + new index, but it's not clear to me if that's the right approach. My initial thinking is to define some sort of routing that pivots on created date to the new index since that's an immutable field. Thoughts? In the case of read throughput, we an create more replicas. Our systems is about 50/50 now, some users are even read/write, others are very read heavy. I'll probably come up with 2 indexing strategies we can apply to an application's index based on the heuristics from the operations they're performing. Thanks for the feedback! Todd On Thu, Jun 5, 2014 at 10:55 AM, joergpra...@gmail.com < joergpra...@gmail.com> wrote: > Thanks for raising the questions, I will come back later in more detail. > > Just a quick note, the idea about "shards scale write" and "replica scale > read" is correct, but Elasticsearch is also "elastic" which means it > "scales out", by adding node hardware. The shard/replica scale pattern > finds its limits in a node hardware, because shards/replica are tied to > machines, and there are the hard resource constraints, mostly disk I/O and > memory related. > > In the end, you can take as a rule of thumb: > > - add replica to scale "read" load > - add new indices (i.e. new shards) to scale "write" load > - and add nodes to scale out the whole cluster for both read and write load > > More later, > > Jörg > > > > > > On Thu, Jun 5, 2014 at 7:17 PM, Todd Nine wrote: > >> Hey Jörg, >> Thank you for your response. A few questions/points. >> >> In our use cases, the inability to write or read is considered a >> downtime. Therefore, I cannot disable writes during expansion. Your alias >> points raise >> some interesting research I need to do, and I have a few follow up >> questions. >> >> >> >> Our systems are fully multi tenant. We currently intend to have 1 index >> per application. Each application can have a large number of types within >> their index. Each cluster could potentially hold 1000 or more >> applications/indexes. Most users will never need more than 5 shards. >> Some users are huge power users, billions of documents for a type with >> several large types. These are the users I am concerned with. >> >> From my understanding and experimentation, Elastic Search has 2 primary >> mechanisms for tuning performance to handle the load. First is the shard >> count. The higher the shard count, the more writes you can accept. Each >> shard has a master which accepts the write, and replicates the write to >> it's replicas. For high write throughput, you increase the count of shards >> to distribute the load across more nodes. For read throughput, you >> increase the replica count. This gives you higher performance on read, >> since you now have more than 1 node per shard you can query to get results. >> >> >> >> Per your suggestion, rather than than copy the documents from an index >> with 5 shards to an index with 10 shards, I can theoretically create a new >> index then add it the alias. For instance, I envision this in the following >> way. >> >> >> Aliases: >> app1-read >> app1-write >> >> >> Initial creation: >> >> app1-read -> Index: App1-index1 (5 shards, 2 replicas) >> app1-write -> Index: App1-index1 >> >> User begins to have too much data in App1-index1. The 5 shards are >> causing hotspots. The following actions take place. >> >> 1) Create App1-index2 (5 shards, 2 replicas) >> >> 2) Update app1-read -> App1-index1 and App1-index2 >> >> 3) Update app1-read -> App1-index1 and App1-index2 >> >> >> I have some uncertainty around
Re: Shard count and plugin questions
der > managing a bunch of indices. If an old index is too small, just start a new > one which is bigger and has more shards and spans more nodes, and add them > to the existing set of indices. With index alias you can combine many > indices into one index name. This is very powerful. > > If you can not estimate the data growth rate, I recommend also to use a > reasonable number of shards from the very start. Say, if you expect 50 > servers to run an ES node on, then simply start with 50 shards on a small > number of servers, and add servers over time. You won't have to bother > about shard count for a very long time if you choose such a strategy. > > Do not think about rivers, they are not built for such use cases. Rivers > are designed as a "play tool" for fetching data quickly from external > sources, for demo purpose. They are discouraged for serious production use, > they are not very reliable if they run unattended. > > Jörg > > > On Thu, Jun 5, 2014 at 7:33 AM, Todd Nine wrote: > >> >> >> >>> 2) https://github.com/jprante/elasticsearch-knapsack might do what you >>> want. >>> >> >> This won't quite work for us. We can't have any down time, so it seems >> like an A/B system is more appropriate. What we're currently thinking is >> the following. >> >> Each index has 2 aliases, a read and a write alias. >> >> 1) Both read and write aliases point to an initial index. Say shard count >> 5 replication 2 (ES is not our canonical data source, so we're ok with >> reconstructing search data) >> >> 2) We detect via monitoring we're going to outgrow an index. We create a >> new index with more shards, and potentially a higher replication depending >> on read load. We then update the write alias to point to both the old and >> new index. All clients will then being dual writes to both indexes. >> >> 3) While we're writing to old and new, some process (maybe a river?) will >> begin copying documents updated < the write alias time from the old index >> to the new index. Ideally, it would be nice if each replica could copy >> only it's local documents into the new index. We'll want to throttle this >> as well. Each node will need additional operational capacity >> to accommodate the dual writes as well as accepting the write of the "old" >> documents. I'm concerned if we push this through too fast, we could cause >> interruptions of service. >> >> >> 4) Once the copy is completed, the read index is moved to the new index, >> then the old index is removed from the system. >> >> Could such a process be implemented as a plugin? If the work can happen >> in parallel across all nodes containing a shard we can increase the >> process's speed dramatically. If we have a single worker, like a river, it >> might possibly take too long. >> >> -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/4qO5BZSxWhc/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGcxRGjZr%3DS9%3DSaOZYddWyNxjvFNJF41T_hHe7inoZOwQ%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGcxRGjZr%3DS9%3DSaOZYddWyNxjvFNJF41T_hHe7inoZOwQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CA%2Byzqf_wMHGZcK%2Bxozm%3Dxz4zh5prKmuob%3DBbXKzrHoVgiOk1wA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: Shard count and plugin questions
Thanks for the feedback Mark. I agree with your thoughts on the testing. We plan on doing some testing, find our failure point, and dial that back to some value that allows us to still run the migration. This way, we can get ahead of the problem. Since a re-index would actually introduce more load on the system, we want to perform it before we start to get too much of a performance problem on any given index. I wanted to clarify my thoughts on the realtime read. I definitely do not want to read the commit log, that would be horribly inefficient since it's just a log. Rather, I would insert an in memory cache into the write path. The in memory cache would be written to as well as the commit log. When a query is executed on a node, it would query Lucene (the shard), as well as search the node's local cache for matching documents. These are then merged into a single result set via whatever ordering parameter is set. When the documents are flushed to Lucene, they would be removed from the cache. I'm sure I'm not the first ES user to conceive this idea. However, ES doesn't have this functionality, and I'm assuming there's a reason for it. Is it simply no one has tried this use case, or is it not a good idea technically? Would it introduce too much memory overhead etc? Thanks, Todd On Wed, Jun 4, 2014 at 11:05 PM, Mark Walkom wrote: > I haven't heard of a limit to the number of indexes, obviously the more > you have the larger the cluster state that needs to be maintained. > > You might want to look into routing ( > http://exploringelasticsearch.com/advanced_techniques.html or > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-routing-field.html) > as an alternative to optimise and minimise index count. > You can also always hedge your bets and create an index with a larger > number of shards, ie not a 1:1, shard:node relationship, and then move the > excess shards to new nodes as they are added. > > I'd be interested to see how you could measure how you'd outgrow an > index though, technically it can just keep growing until the node can no > longer deal with it. This is something that testing is good for, throw data > at a single shard index and then when it falls over you have an indicator > of how your hardware will handle things. > > As for reading the transaction log and searching it, you might be playing > a losing game as your code to parse and search would have to be super quick > to make worth doing. > > Regards, > Mark Walkom > > Infrastructure Engineer > Campaign Monitor > email: ma...@campaignmonitor.com > web: www.campaignmonitor.com > > > On 5 June 2014 15:33, Todd Nine wrote: > >> Thanks for the answers Mark. See inline. >> >> >> On Wed, Jun 4, 2014 at 3:51 PM, Mark Walkom >> wrote: >> >>> 1) The answer is - it depends. You want to setup a test system with >>> indicative specs, and then throw some sample data at it until things start >>> to break. However this may help >>> https://www.found.no/foundation/sizing-elasticsearch/ >>> >> >> This is what I was expecting. Thanks for the pointer to the >> documentation. We're going to have some pretty beefy clusters (SSDs Raid >> 0, 8 to 16 cores and a lot of RAM) to power ES. We're going to have a LOT >> of indexes, we would be operating this as a core infrastructure service. >> Is there an upper limit on the amount of indexes a cluster can hold? >> >> >>> 2) https://github.com/jprante/elasticsearch-knapsack might do what you >>> want. >>> >> >> This won't quite work for us. We can't have any down time, so it seems >> like an A/B system is more appropriate. What we're currently thinking is >> the following. >> >> Each index has 2 aliases, a read and a write alias. >> >> 1) Both read and write aliases point to an initial index. Say shard count >> 5 replication 2 (ES is not our canonical data source, so we're ok with >> reconstructing search data) >> >> 2) We detect via monitoring we're going to outgrow an index. We create a >> new index with more shards, and potentially a higher replication depending >> on read load. We then update the write alias to point to both the old and >> new index. All clients will then being dual writes to both indexes. >> >> 3) While we're writing to old and new, some process (maybe a river?) will >> begin copying documents updated < the write alias time from the old index >> to the new index. Ideally, it would be nice if each replica could copy >> only it's local documents into the new index. We
Re: Shard count and plugin questions
Thanks for the answers Mark. See inline. On Wed, Jun 4, 2014 at 3:51 PM, Mark Walkom wrote: > 1) The answer is - it depends. You want to setup a test system with > indicative specs, and then throw some sample data at it until things start > to break. However this may help > https://www.found.no/foundation/sizing-elasticsearch/ > This is what I was expecting. Thanks for the pointer to the documentation. We're going to have some pretty beefy clusters (SSDs Raid 0, 8 to 16 cores and a lot of RAM) to power ES. We're going to have a LOT of indexes, we would be operating this as a core infrastructure service. Is there an upper limit on the amount of indexes a cluster can hold? > 2) https://github.com/jprante/elasticsearch-knapsack might do what you > want. > This won't quite work for us. We can't have any down time, so it seems like an A/B system is more appropriate. What we're currently thinking is the following. Each index has 2 aliases, a read and a write alias. 1) Both read and write aliases point to an initial index. Say shard count 5 replication 2 (ES is not our canonical data source, so we're ok with reconstructing search data) 2) We detect via monitoring we're going to outgrow an index. We create a new index with more shards, and potentially a higher replication depending on read load. We then update the write alias to point to both the old and new index. All clients will then being dual writes to both indexes. 3) While we're writing to old and new, some process (maybe a river?) will begin copying documents updated < the write alias time from the old index to the new index. Ideally, it would be nice if each replica could copy only it's local documents into the new index. We'll want to throttle this as well. Each node will need additional operational capacity to accommodate the dual writes as well as accepting the write of the "old" documents. I'm concerned if we push this through too fast, we could cause interruptions of service. 4) Once the copy is completed, the read index is moved to the new index, then the old index is removed from the system. Could such a process be implemented as a plugin? If the work can happen in parallel across all nodes containing a shard we can increase the process's speed dramatically. If we have a single worker, like a river, it might possibly take too long. 3) How real time is real time? You can change index.refresh_interval to > something small so that window of "unflushed" items is minimal, but that > will have other impacts. > Once the index call returns to the caller, it would be immediately available for query. We're tried lowering the refresh rate, this results is a pretty significant drop in throughput. To meet our throughput requirements, we're considering even turning it up to 5 or 15 seconds. If we can then search this data that's in our commit log (via storing it in memory until flush) that would be ideal. Thoughts? > Regards, > Mark Walkom > > Infrastructure Engineer > Campaign Monitor > email: ma...@campaignmonitor.com > web: www.campaignmonitor.com > > > On 5 June 2014 04:18, Todd Nine wrote: > >> Hi All, >> We've been using elastic search as our search index for our new >> persistence implementation. >> >> https://usergrid.incubator.apache.org/ >> >> I have a few questions I could use a hand with. >> >> 1) Is there any good documentation on the upper limit to count of >> documents, or total index size, before you need to allocate more shards? >> Do shards have a real world limit on size or number of entries to keep >> response times low? Every system has it's limits, and I'm trying to find >> some actual data on the size limits. I've been trolling Google for some >> answers, but I haven't really found any good test results. >> >> >> 2) Currently, it's not possible to increase the shard count for an index. >> The workaround is to create a new index with a higher count, and move >> documents from the old index into the new. Could this be accomplished via >> a plugin? >> >> >> 3) We sometimes have "realtime" requirements. In that when an index call >> is returned, it is available. Flushing explicitly is not a good idea from >> a performance perspective.Has anyone explored searching in memory the >> documents that have not yet been flushed and merging them with the Lucene >> results? Is this something that's feasible to be implemented via a plugin? >> >> Thanks in advance! >> Todd >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group.
Re: cross data center replication
Hey all, Sorry to resurrect a dead thread. Did you ever find a solution for eventual consistency of documents across EC2 regions? Thanks, todd On Wednesday, May 1, 2013 5:50:00 AM UTC-7, Norberto Meijome wrote: > > +1 on all of the above. es-reindex already in my list of things to > investigate (for a number of issues...) > > cheers, > b > > > On Wed, May 1, 2013 at 6:58 AM, Paul Hill > > wrote: > >> On 4/23/2013 8:44 AM, Daniel Maher wrote: >> >>> On 2013-04-23 5:22 PM, Saikat Kanjilal wrote: >>> Hello Folks, [...] does ES out of the box currently support cross data center replication, [] >>> >>> Hello, >>> >>> I'd wager that the question you're really asking about is how to control >>> where shards are placed; if you can make deterministic statements about >>> where shards are, then you can create your own "rack-aware" or "data >>> centre-aware" scenarios. ES has supported this "out of the box" for well >>> over a year now (possibly longer). >>> >>> You'll want to investigate "zones" and "routing allocation", which are >>> the key elements of shard placement. There is an excellent blog post which >>> describes exactly how to set things up here : >>> http://blog.sematext.com/2012/05/29/elasticsearch-shard- >>> placement-control/ >>> >>> Is shard allocation really the correct solution if the data centers are >> globally distributed? >> >> If I have a data center in the US intended to server data from the US, >> but it should also have access to Europe and Asia data, and clusters in >> both Europe and Asia with similar needs, would I really want to use zones >> etc. and have one great global cluster with data center aware >> configurations? >> >> Assuming that the US would be happy to deal with old documents from Asia >> and Europe, when Asia or Europe is off line or just not caught up, it would >> seem that you would NOT want a "world" cluster, because I can't picture how >> you'd configure a 3-part world cluster for both index into the right >> indices, search the right (possible combination of) shards, but also >> preventing "split brain". >> >> In the scenerio, I've described, I would think each data center might >> better provide availability and eventual consistency (with less concern for >> the remote data from the other region) by having three clusters and some >> type of syncing from one index to copies at the other two locations. For >> example, the US datacenter might have a US, copyOfEurope, and copyOfAsia >> index. >> >> Anyone have any observations about such a world-wide scenerio? >> Are there any index to index copy utilities? >> Is there a river or other plugin that might be useful for this three >> clusters working together scenerio? >> How about the project https://github.com/karussell/elasticsearch-reindex? >> Comments? >> >> -Paul >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com . >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> > > > -- > Norberto 'Beto' Meijome > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/646067d1-1137-4777-be51-ced0bd6a3edd%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Shard count and plugin questions
Hi All, We've been using elastic search as our search index for our new persistence implementation. https://usergrid.incubator.apache.org/ I have a few questions I could use a hand with. 1) Is there any good documentation on the upper limit to count of documents, or total index size, before you need to allocate more shards? Do shards have a real world limit on size or number of entries to keep response times low? Every system has it's limits, and I'm trying to find some actual data on the size limits. I've been trolling Google for some answers, but I haven't really found any good test results. 2) Currently, it's not possible to increase the shard count for an index. The workaround is to create a new index with a higher count, and move documents from the old index into the new. Could this be accomplished via a plugin? 3) We sometimes have "realtime" requirements. In that when an index call is returned, it is available. Flushing explicitly is not a good idea from a performance perspective.Has anyone explored searching in memory the documents that have not yet been flushed and merging them with the Lucene results? Is this something that's feasible to be implemented via a plugin? Thanks in advance! Todd -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/940c6404-6667-4846-b457-977e705d3797%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.