Sum aggregation with results from other aggregations?
Is it possible to create an aggregation where I can do a sum on the results of a sub bucket? I'm working on twitter data. In this data I have a bunch of retweets of different users. Say that user A has 10 tweets that are retweeted a hundred times in my dataset. I want to find the maximum retweet_count for each individual tweet, and then I want to find the sum of all of those maximums from an individual user. This is the base query structure I'm working with: { "aggs": { "user_id": { "terms": { "field": "retweet_user_id" }, "aggs": { "tweet_ids": { "terms": { "field": "retweet_id", "order": "max_tweet.value" }, "aggs": { "max_tweet": { "max": { "field": "retweet_count" } } } } } } } } Importantly here, I don't want to just take a sum of "retweet_count" for a given retweet_user_id - this doesn't give the max value per tweet. Essentially, is it possible for me to take a sum of the agg results at user_id.tweet_ids.max_tweet.value, and use that as an "order" term in the user_id terms agg? -- Please update your bookmarks! We have moved to https://discuss.elastic.co/ --- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/100b463d-0f95-4801-aec7-e32544624518%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Elasticsearch twitter river filtered stream question
Quick question about the ES twitter river at https://github.com/elasticsearch/elasticsearch-river-twitter The twitter streaming API allows you to filter, and you apparently get up to 1% of the stream total, with our search queries. So, if I were filtering for "coffee", I'd get "coffee" tweets that I wouldn't get if I was just capturing the 1% stream passively. Does the Twitter river use this filter functionality, or does it do its filtering on the ingestion side, ingesting the normal 1% stream and discarding anything that doesn't match -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b7adb5f1-49a1-4424-8f4e-1c75e15c4cb0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Automatic index balancing plugin or other solution?
Interesting, ok. A colleague went to training with Elasticsearch and was told that given a default index with N shards similar index size was a critical thing for maintaining consistent search performance. I guess maybe that could play out by a two billion record index having a huge number of unique terms, while a smaller, say, 100k record index would have a substantially smaller set of terms, right? Dealing with content from stuff like the Twitter public API, I would anticipate a fairly linear growth of unique terms and overall index size. This ultimately results in the scenario initially, where a larger index is comparatively slower to search, due to its necessarily increased dictionary size. It seems as though there'd still be room for the kind of automatically scaling via a template system described above? On Wednesday, April 9, 2014 7:38:35 AM UTC-7, Jörg Prante wrote: > > The number of documents is not relevant to the search time. > > Important factors for search time are the type of query, shard size, the > number of unique terms (the dictionary size), the number of segments, > network latency, disk drive latency, ... > > Maybe you mean equal distribution of docs with same average size across > shards. This means a search does not have to wait for nodes that must > search in larger shards. > > I do not think this needs a river plugin, since equal distribution of docs > over the shards is the default. > > Jörg > > > On Tue, Apr 8, 2014 at 9:03 PM, Josh Harrison > > wrote: > >> I have heard that ideally, you want to have a similar number of documents >> per shard for optimal search times, is that correct? >> >> I have data volumes that are just all over the place, from 100k to tens >> of millions in a week. >> >> I'm thinking about a river plugin that could: >> Take a mapping object as a template >> Define a template for child index names (project_\_\MM_\DD_\NNN = >> project_2014_04_08_000, etc) >> Define index shard count (5) >> Define maximum index size (1,000,000) >> Define a listening endpoint of some sort >> >> Documents would stream into the listening endpoint however you wanted, >> rivers, bulk loads using an API, etc. They would be automatically routed to >> the lowest numbered not-full index. So on a given day you could end up with >> fifteen indexes, or eighty, or two, but they'd all be a maximum of N >> records. >> >> A plugin seems desirable in this case, as it frees you from needing to >> write the load balancing into every ingestion stream you've got. >> >> Is this a reasonable solution to this problem? Am I overcomplicating >> things? >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com . >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8537dab8-8831-42a5-97b0-92367d3753ca%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Automatic index balancing plugin or other solution?
I have heard that ideally, you want to have a similar number of documents per shard for optimal search times, is that correct? I have data volumes that are just all over the place, from 100k to tens of millions in a week. I'm thinking about a river plugin that could: Take a mapping object as a template Define a template for child index names (project_\_\MM_\DD_\NNN = project_2014_04_08_000, etc) Define index shard count (5) Define maximum index size (1,000,000) Define a listening endpoint of some sort Documents would stream into the listening endpoint however you wanted, rivers, bulk loads using an API, etc. They would be automatically routed to the lowest numbered not-full index. So on a given day you could end up with fifteen indexes, or eighty, or two, but they'd all be a maximum of N records. A plugin seems desirable in this case, as it frees you from needing to write the load balancing into every ingestion stream you've got. Is this a reasonable solution to this problem? Am I overcomplicating things? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Please explain the flow of data?
Awesome, ok, thank you. Is the logic behind not allowing storage on master nodes to both: Take advantage of a system with limited storage resources and Have a dedicated results aggregator/search handler? I can imagine if I had a particularly badly written gnarly search, trying to deal with the results on a master and a querying the results at the same time could be bad. So in a 16 node cluster you'd want to have 9 nodes allowed to be masters, (n/2)+1? Thanks again! Josh On Friday, March 21, 2014 3:20:24 PM UTC-7, Mark Walkom wrote: > > A couple of things; > >1. You should have n/2+1 masters in your cluster, where n = number of >nodes. This helps prevent split brain situations and is best practise. >2. Your master nodes can store data, this way you don't need to add >more nodes to fulfil the above. > > Your indexing scenario is correct. > For searching, replica's and primaries can be queried. > For both - Adding more masters adds redundancy as per the first two > points. Adding more search nodes won't do much though other than reduce the > load on your masters (unless someone else can add anything I don't know :p). > > And for your final question, yes that is correct. > > To give you an idea of practical application, we don't use search nodes > but have 3 non-data masters that handle all queries, and a bunch of data > only nodes for storing everything. > > Regards, > Mark Walkom > > Infrastructure Engineer > Campaign Monitor > email: ma...@campaignmonitor.com > web: www.campaignmonitor.com > > > On 22 March 2014 08:25, Josh Harrison >wrote: > >> I'm trying to build a basic understanding of how indexing and searching >> works, hopefully someone can either point me to good resources or explain! >> I'm trying to figure out what having multiple "coordinator" nodes as >> defined in the elasticsearch.yml would do, and what having multiple "search >> load balancer" nodes would do. Both in the context of indexing and >> searching. >> Is there a functional difference between a "coordinator" node and a >> "search load balancer" node, beyond the fact that a "search load balancer" >> node can't be elected master? >> >> >> Say I have a 4 node cluster. There's a master only "coordinator" node, >> that doesn't store data, named "master". >> node.master: true >> node.data: false >> >> There are three data only nodes, "A", "B" and "C" >> node.master: false >> node.date: true >> >> I have an index "test" with two shards and one replica. Primary shard 0 >> lives on A, primary shard 1 lives on C, replica shard 0 lives on B, replica >> shard 1 lives on A. >> >> I send the command >> curl -XPOST http://master:9200/test/test -d '{"foo":"bar"}' >> >> A connection is made to master, and the data is sent to master to be >> indexed. Master randomly decides to place this document in shard 1, so it >> gets sent to the primary shard 1 on C and replica shard 1 on B, right? This >> is where routing can come in, I can say that that document really should go >> to shard 0 because I said so. >> >> So this is a fairly simple scenario, assuming I'm correct. >> >> What benefit do I get to indexing when I add more "coordinator" nodes? >> node.master: true >> node.data: false >> >> What about if I add "search load balancer" nodes? >> node.master: false >> node.data: false >> >> >> >> How about on the searching side of things? >> I send a search to master, >> curl -XPOST http://master:9200/test/test/_search -d >> '{"query":{"match_all":{}}}' >> >> Master sends these queries off to A, B and C, who each generate their own >> results and return them to master. Each data node queries all the relevant >> shards that are present locally and then combines those results for >> delivery to master. Do only primary shards get queried, or are replica >> shards queried too? >> Master takes these combined results from all the relevant nodes and >> combines them into the final query response. >> >> Same questions: >> What benefit do I get to searching when I add more nodes that are like >> master? >> node.master: true >> node.data: false >> >> What about if I add "search load balancer" nodes? >> node.master: false >> node.data: false >> >>
Please explain the flow of data?
I'm trying to build a basic understanding of how indexing and searching works, hopefully someone can either point me to good resources or explain! I'm trying to figure out what having multiple "coordinator" nodes as defined in the elasticsearch.yml would do, and what having multiple "search load balancer" nodes would do. Both in the context of indexing and searching. Is there a functional difference between a "coordinator" node and a "search load balancer" node, beyond the fact that a "search load balancer" node can't be elected master? Say I have a 4 node cluster. There's a master only "coordinator" node, that doesn't store data, named "master". node.master: true node.data: false There are three data only nodes, "A", "B" and "C" node.master: false node.date: true I have an index "test" with two shards and one replica. Primary shard 0 lives on A, primary shard 1 lives on C, replica shard 0 lives on B, replica shard 1 lives on A. I send the command curl -XPOST http://master:9200/test/test -d '{"foo":"bar"}' A connection is made to master, and the data is sent to master to be indexed. Master randomly decides to place this document in shard 1, so it gets sent to the primary shard 1 on C and replica shard 1 on B, right? This is where routing can come in, I can say that that document really should go to shard 0 because I said so. So this is a fairly simple scenario, assuming I'm correct. What benefit do I get to indexing when I add more "coordinator" nodes? node.master: true node.data: false What about if I add "search load balancer" nodes? node.master: false node.data: false How about on the searching side of things? I send a search to master, curl -XPOST http://master:9200/test/test/_search -d '{"query":{"match_all":{}}}' Master sends these queries off to A, B and C, who each generate their own results and return them to master. Each data node queries all the relevant shards that are present locally and then combines those results for delivery to master. Do only primary shards get queried, or are replica shards queried too? Master takes these combined results from all the relevant nodes and combines them into the final query response. Same questions: What benefit do I get to searching when I add more nodes that are like master? node.master: true node.data: false What about if I add "search load balancer" nodes? node.master: false node.data: false Is the only difference between a node.master: true node.data: false and a node.master: false node.data: false that the node is a candidate to be a master, should it be elected? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eaff1d85-1e85-422d-bfba-9a0825ed5da9%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: elasticsearch-py usage
It doesn't look like the elasticsearch-py API covers the river use case. When I've run into things like this I've always just run a manual CURL request, or if I need to do it from within a script I just do a basic command with requests, ala requests.put("http://localhost:9200/_river/mydocs/_meta"; data='{"type": "fs", "fs": { "url": "/tmp", "update_rate": 90, "includes": "*.doc,*.pdf", "excludes": "resume" }}') Not the most elegant approach, but it works! On Thursday, March 13, 2014 1:57:55 PM UTC-7, Kent Tenney wrote: > > From the fsriver doc: > > curl -XPUT 'localhost:9200/_river/mydocs/_meta' -d '{ > "type": "fs", > "fs": { > "url": "/tmp", > "update_rate": 90, > "includes": "*.doc,*.pdf", > "excludes": "resume" >} > }' > > How does this translate to the Python API? > > Thanks, > Kent > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c9d25008-fe3d-43dc-a57f-e8e510f8a3ce%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Best way to duplicate data across clusters live?
Kafka looks interesting, though at this point we're actively trying to reduce the number of moving parts, so I think an AMQ based approach is what we'll ultimately go for. Seems like there might be room here for an elasticsearch-elasticsearch-river plugin or something - to do one or two way close to real time replication on some selected set of indexes between separate clusters. That way you could easily mirror prod data to a dev environment without depending on the ability to do the duplication earlier in the pipeline, or depending on scripts to move the data around. On Wednesday, March 12, 2014 2:46:19 PM UTC-7, Otis Gospodnetic wrote: > > Consider Kafka 0.8.1. It comes with a MirrorMaker tool that mirrors Kafka > data (to multiple DCs). Once data is local, you can feed your ES from the > local Kafka broker. > > Otis > -- > Performance Monitoring * Log Analytics * Search Analytics > Solr & Elasticsearch Support * http://sematext.com/ > > > On Wednesday, March 12, 2014 2:55:58 PM UTC-4, Josh Harrison wrote: >> >> Say I have clusters A and B. Cluster A is consuming data using an >> ActiveMQ river. I would like to stream data to cluster B as well. Do I just >> create a secondary outbound AMQ channel and subscribe cluster B to it, or >> is there a decent way to have a live copy of data going two places at once? >> > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6b1bdbe4-e2fa-4b10-9298-62d3d1869842%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Best way to duplicate data across clusters live?
Say I have clusters A and B. Cluster A is consuming data using an ActiveMQ river. I would like to stream data to cluster B as well. Do I just create a secondary outbound AMQ channel and subscribe cluster B to it, or is there a decent way to have a live copy of data going two places at once? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e58efbd5-0cc0-436d-8a41-5f7987587881%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Too many nodes started up on some data nodes - best approach to fix?
I restarted my cluster the other day, but something odd stuck, resulting in 15/16 data nodes starting up an extra ES instance in the same cluster. This ended badly as there were two nodes with identical display names, the system locked up, etc. When restarting again, to my horror, we were missing shards. I quickly figured out that the missing shards had gotten moved into the second instance storage location. What is the best way to resolve this? Should we either spawn second ES instances on the culprit machines (with different instance names), or can a simple mv escluster/nodes/1/indices/data1/* escluster/nodes/0/indices/data1/ do the job? Thanks! -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5ec79698-5ea3-4ba9-a81d-0665a23a9bd5%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Random scan results?
Darn ok. Thank you. If I'm retrieving large numbers of random largish (twitter river records) documents, is there a particular pattern I should use for searching? That is, does it make sense to send 20 sequential queries with size 10,000 and random sorting, or a single query with a size of 200,000? What about up into the millions? Obviously we're risking duplication of results when sending multiple smaller queries, but this is OK for our purposes, or can be dealt with at another stage of the process outside ES. Thanks, Josh On Wednesday, February 19, 2014 12:41:58 PM UTC-8, Adrien Grand wrote: > > Hi Josh, > > In order to run efficiently, scan queries read records sequentially on > disk and keep a cursor that is used to maintain state between successive > pages. It would not be possible to get records in a random order as it > would not be possible to read sequentially anymore. > > > On Wed, Feb 19, 2014 at 9:04 PM, Josh Harrison > > wrote: > >> I need to be able to pull 100s of thousands to millions of random >> documents from my indexes. Normally, to pull data this large I'd do a scan >> query, but they don't support sorting, so the suggestions I've seen online >> for randomizing your results don't work (such as those discussed here: >> http://stackoverflow.com/questions/9796470/random-order-pagination-elasticsearch >> ). >> Is there a way to introduce randomness into a basic scan query? >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com . >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/b3971dda-2963-48ce-b7ed-f50e85b82a97%40googlegroups.com >> . >> For more options, visit https://groups.google.com/groups/opt_out. >> > > > > -- > Adrien Grand > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fabec423-97a6-4246-bf11-5d2899ca64b9%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Random scan results?
I need to be able to pull 100s of thousands to millions of random documents from my indexes. Normally, to pull data this large I'd do a scan query, but they don't support sorting, so the suggestions I've seen online for randomizing your results don't work (such as those discussed here: http://stackoverflow.com/questions/9796470/random-order-pagination-elasticsearch). Is there a way to introduce randomness into a basic scan query? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b3971dda-2963-48ce-b7ed-f50e85b82a97%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Documents per shard
I've got indexes storing the same kind of data split into weekly chunks - there has been some fairly substantial variation in data volume. I've got a mapping change I need to make across all the back data, and I'm thinking it might make sense to try to rebalance the documents per shard so that I have around 1 shard per N documents. Is that a worthwhile time investment in terms of query performance, or should I just stick with the 3 shards per index I've been using so far? I'd keep 3 shards as a minimum, so if there's a week with 10 documents it would still have 3 shards. If I have an index that would end up with more than one shard per data node, does it make more sense to limit the number of shards to the number of data nodes, or go ahead and follow the 1 shard per N documents pattern? Thanks! Josh -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d775acd2-9717-4fe6-bbda-9c6d42f0cb39%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Data Loss
This particular cluster is 16 data nodes with SSD RAIDs connected to each other and the two master nodes with infiniband. Under 100 indexes and usually 3 shards per index with 1 replica. Overall data volume is in the 1TB range. I haven't tweaked the shard allocation settings from default. -Josh On Wednesday, February 12, 2014 1:36:53 PM UTC-8, Tony Su wrote: > > Josh, > > Your experience about recovering in only about 10 minutes is very > interesting. > Because my little 5-node cluster/15GB data/3500 indices is taking about an > hour to recover and i know the bottleneck is the disk subsystem I'm > currently on, > > Am curious > - What is the total size data in your cluster? > - how many indices? > - Are the shard numbers pretty typical (5 shards per index, 1 replica for > every shard) > - Are you storing your data on a SAN, SCSI array or something else, and if > the disks are SSD? > > Thx, > Tony > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1eccb9f9-4ee7-42b3-adde-352f0c8c65d7%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Data Loss
I'm sure it isn't the case for everyone that is having data/shard problems, but I had some real trouble doing a full cluster restart on an 18 node cluster. Kinda nightmarish, actually, shards failing all over the place, lost data because of lost shards, etc. I finally realized that the gateway.recover_after_nodes, gateway.expected_nodes and gateway.recover_after_time config properties were critical to avoiding my situation. Before the gateway configuration stuff was in there, it would take literally hours and a lot of work to get everything back to green. We dreaded a full cluster restart. After the gateway configuration stuff, a full cluster restart, from service restart on all systems to full green, takes anywhere from 2-10 minutes total. The root cause in my situation was a few nodes coming up in the cluster, and seeing a severely degraded state and trying to "fix" everything, resulting in chaos as more nodes came up. Hopefully this is helpful to someone! http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html -Josh On Wednesday, February 12, 2014 11:49:29 AM UTC-8, Tony Su wrote: > > IMO evaluating this issue starts with applying the CAP Theorem which in > summary states that networked clusters with multiple nodes can offer only 2 > of the following 3 desirable objectives > > Consistency > Availability > Partition tolerance (data distributed across nodes). > > ES clearly does the last two so in theory cannot guarantee the first. > Of course "guarantee" is not the same as "best effort" which as expected > is being done. > And, this Theorem applies to multi-node cluster technologies of > which ES is one. > > Tony > > > > On Wednesday, February 12, 2014 8:09:58 AM UTC-8, Brad Lhotsky wrote: > >> Appreciated, but keep in mind large installations can’t just constantly >> upgrade. And if ES is being used in critical infrastructure upgrading may >> mean many hours of recertification work with auditors and assessors. The >> project is still relatively young, but "just upgrade" isn’t always >> plausible. It takes over 2 hours for a cluster to go back to green when a >> single node restarts for my logging cluster. I have 15 nodes now, which >> means a safe upgrade path may take literally 1 working week. That assumes >> I can have nodes with different versions in the cluster. Or I have to lose >> data while I restart the *whole* cluster, which a whole cluster restart is >> also ~ 4 hours. >> >> -- >> Brad Lhotsky >> >> On 12 Feb 2014 at 16:07:20, Binh Ly (bi...@hibalo.com) wrote: >> >> FYI, ES has very frequent releases to fix bugs discovered by the >> community. If you find a data loss problem in your current install (and >> assuming it is indeed an ES problem), please try the latest build and see >> if it fixes it. Chances are it has already been discovered and fixed in the >> latest release. >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/71fe3210-4873-417e-859a-7f83e980f2ed%40googlegroups.com >> . >> For more options, visit https://groups.google.com/groups/opt_out. >> >> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c23ad042-1fdd-40da-976c-5df12a29d96b%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Plugin development guidance
Great, thanks Jörg! I'll start fiddling around with the langdetect plugin to see if I can get it going with our library. On Tue, Feb 11, 2014 at 1:18 PM, joergpra...@gmail.com < joergpra...@gmail.com> wrote: > An analyzer plugin is the right thing. Adding the recognized/extracted > terms needs access to ES mapping service. There are a few plugins out there > which work in this manner, for example, the attachment mapper plugin. > > Or the lang-detect plugin, it adds the recognized language(s) as a keyword > code into a neighbor field for filtering or faceting: > https://github.com/jprante/elasticsearch-langdetect > > Also, I developed a similar plugin that works with recognition techniques, > it can recognize ISBN or other standard number in a text, and injects extra > tokens into the token stream to identify these numbers: > https://github.com/jprante/elasticsearch-analysis-standardnumber > > Jörg > > -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/1lexzKdBbP8/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEzwFzy4zGbcg1w66LQgcEq8L5O9tjj2ke_6krw9nc%2B7A%40mail.gmail.com > . > > For more options, visit https://groups.google.com/groups/opt_out. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALX3-AmgVoyoTy1nso0_SW%3DaVNkZ3aSKXN8tbPTnmrOfqjnVDQ%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
Plugin development guidance
Hi all, We've got an internal Java library that allows us to do keyword extraction that seems like a great thing to turn into an integrated elasticsearch function. Ultimately, I want to be able to access the result of this library from search results/etc, but I wanted to do a sanity check to make sure my approach was right - or if I should be looking at doing a custom analyzer or something instead. Given a string field, the type would become a multi-field, {name} and keywords/phrases as subfields. A plugin would be written to handle this keywords field, run the strings through the library and return a list of strings like: "my_data":"Jack and Jill went up the hill, Jack fell down and bumped his crown, and Jill came tumbling after." "my_data.keywords":["Jack", "Jack fell"] That's a trivial example, of course, and the algorithm is more complex than the standard stopword filtering. Ultimately, I want to be able to expose the my_data.keywords field as an actual list like above, so that we can use it in other things like facets down the line. So is a custom type plugin the right way to go here, or should I be looking at developing a more complex analyzer/tokenizer/stopword combo? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/44d3ba64-f0b3-4727-9a49-745a2167d34d%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Stress testing queries
I think a basic tool that is relatively independent of cluster specifics could be pretty useful. I'm imagining a tool that allows you to do load testing against any cluster you point it at to: Test indexing by selecting the complexity of data objects you're interested in - ie, create X test indices with X shards and Y replicas each, and send either a custom object with fields that could be defined for length and type of random variables, or basic objects at various sizes (an example tweet, log record, simple data point, etc). As a better example, if I wanted to see how my system did with objects that looked like {"test1":18, "test2":[{"test3":"XnksdjfknSeorjOJosdimkn skjnfds sdfsidfun ifsdfosdmfo"}, {"test3":"dslkmlkdsfnUFIDNSiufndsfn DFISnu"}]}, we could just pass something like a mapping, {"test1":{"type":"int", "maxSize":1000, "minSize":-1029420},"test2":{"type":"nested", "minSize":0, "maxSize":20, "properties":{"test3":{"type":"string", "maxSize": 160, "minSize":20 Random objects would be generated according to the specifications of the template. Maybe you could also pick a type "english", "french", "russian", etc, to generate strings that are actually language based on a dictionary of terms. (or be able to define custom "types" pointing at text files). You could then indicate how many objects you want created in any given index, also a min/max range, the number of workers writing to any given index, the total number of workers you want writing data, the total number of indexes, etc. Data could all be sent over the native API, I'm a Python guy so I'll say Python, or over HTTP using something like the requests module. This could allow for interesting comparisons of the APIs. Test queries by doing pretty much the same thing. Track the query response times per worker, and other relevant stats. Have definable "max requests per second per worker", maybe, so you can replicate your worst case user behavior in each process. This would be step 1 of the process, step 2 would be developing something so that a central test system could allocate testing jobs and collect stats across a number of client test systems. So I'd set up a test service on host A, and test clients on hosts B and C. B and C would be sent job properties by A, B and C would then launch, track their own stats, and send them to A to aggregate. Scale out to as many systems as you like. This is just a first pass at the idea, there may be some dumb mistakes in logic or oversights about test cases, but I think an app like this could be pretty useful. Heck, you could have a GUI on it, or just make it run off a yaml file or something. If it gets into my head enough maybe I'll try to write this up, though like I said, it'd be in python since that's my language of choice. So it wouldn't be as optimal a testing platform as a native Java app, I guess, but still useful as a proof of concept. On Thursday, January 30, 2014 4:41:06 PM UTC-8, Josh Harrison wrote: > > In our case, we're just interested in query stress testing. We've got a > web app that queries our indexes that are organized based on weeks of the > year, with a bunch of aliases making it so specific portions of the data > can be reached easily. Questions about scaling the app have come up. In our > case, that means testing through the app itself, which so far only makes > queries. I figure we should load test our cluster directly too, so we can > see if there is a bottleneck somewhere in the app, if any eventual > bottlenecks are on the cluster itself. > > So far I haven't been able to really max out the indexing rate on a system > that is adequately equipped with resources, that I can tell. I've had 32 > sub-process Python workers happily sending, I think, ~5+ million records an > hour to our cluster with no problem in indexing speed or other response > time when backloading some data. > My current strategy is to get the ugliest heavy queries the application > runs and simply use ABS or something similar to run queries over http with > variables that are in a reasonable range. If I can make my cluster crash by > doing that, I know that'll be my upper limit! > > > > > On Thursday, January 30, 2014 3:59:19 PM UTC-8, Jörg Prante wrote: >> >> Just a few questions, because I'm also interested in load testing. >> >> What kind of stress do you think of? Random data? Wikipedia? Logfiles? >> Just query? What about indexing? And what client? Java? Other script >> languages? How should the c
Re: Stress testing queries
In our case, we're just interested in query stress testing. We've got a web app that queries our indexes that are organized based on weeks of the year, with a bunch of aliases making it so specific portions of the data can be reached easily. Questions about scaling the app have come up. In our case, that means testing through the app itself, which so far only makes queries. I figure we should load test our cluster directly too, so we can see if there is a bottleneck somewhere in the app, if any eventual bottlenecks are on the cluster itself. So far I haven't been able to really max out the indexing rate on a system that is adequately equipped with resources, that I can tell. I've had 32 sub-process Python workers happily sending, I think, ~5+ million records an hour to our cluster with no problem in indexing speed or other response time when backloading some data. My current strategy is to get the ugliest heavy queries the application runs and simply use ABS or something similar to run queries over http with variables that are in a reasonable range. If I can make my cluster crash by doing that, I know that'll be my upper limit! On Thursday, January 30, 2014 3:59:19 PM UTC-8, Jörg Prante wrote: > > Just a few questions, because I'm also interested in load testing. > > What kind of stress do you think of? Random data? Wikipedia? Logfiles? > Just query? What about indexing? And what client? Java? Other script > languages? How should the cluster be configured, one node? two or more > nodes? Index shards? Replica? etc. etc. > > There are so many variants and options out there, I believe this is one of > the reason why a compelling load testing tool is still missing. > > It would be nice to have a tool to upload ES performance profiles to a > public web site, for checking how well an ES cluster is tuned in comparison > to others. A measure unit for comparing performance is needed to be > defined, e.g. "this cluster performs with a power factor of 1.0, this > cluster has power factor 1.5, 2.0, ..." > > That's only possible when all software and hardware characteristics are > properly taken into account, plus "application profiles" for a typical > workloads, so it can be decided which configuration is best for what > purpose. > > Jörg > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8b33e17a-0b27-4926-a8bd-f467357198c5%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Stress testing queries
Are there any decent ES specific stress testing tools out there that would allow me to test what kinds of simultaneous load my cluster can handle with concurrent users making queries? Searched around a bit and didn't see anything. Figured I'd ask before I come up with a test approach of my own! Thanks -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/83ccba10-c387-4c00-95b1-6f68cc1d6312%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
[Hadoop] capability clarification questions
In looking around I haven't been able to find explicit answers to these questions - though the questions may entirely be because I'm a hadoop newbie. If we were to deploy ES within a hadoop environment: The primary benefit is allowing direct interaction with ES from Hadoop, running queries or indexing data, is that right? Are there explicit benefits to search speed and capability when run through the normal REST or other client APIs? That is to say, if I have a set of N documents and a query that takes T seconds to run on a normal cluster through curl, would there be a marked improvement in T when running the same query through curl against a hadoop enabled cluster? Are the ideal architecture designs for a hadoop enabled ES cluster the same, or similar to, a "regular" cluster? If they're the same, does a hadoop enabled cluster need to be designed as such from the start, or can that functionality be tacked on to an already functioning cluster with data? Situation is, we're on a cluster of machines running hadoop, but the ES nodes are just running on the compute nodes like a regular service. Wondering what it would take to enable the hadoop capabilities. Thanks! -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4dea6c95-75b8-4ed7-a054-3f9eaedde9d3%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Multiple nodes on a powerful system?
Thanks Jörg, Mark and Nikolas, some great information here. The 6x6 configuration was something of a worst case example, the farthest we'd probably stretch it would be 3 nodes per host on 16-18 hosts, which should be a little more reasonable. Hopefully we'll be able to do a support contract with the commercial side of ES and get some help building out a system that meets our exact needs. On Wednesday, January 29, 2014 5:02:05 PM UTC-8, Jörg Prante wrote: > > You should consider the RAM to CPU/Disk ratio. On systems with huge > memory, CPUs have the tendency to become weak, and the I/O subsystem must > push data with higher pressure from RAM to drive (spindle or SSD). > Huge RAM helps for caching strategies but also creates headaches, large > caches must be long lived and must not collapse, which is hard in a large > JVM heap, and JVM garbage collection will take more resources and time. > > Running multiple JVMs on a single host only looks like a viable solution, > but that is not how ES scales. ES scales horizontally over many machines, > not vertically over RAM size. > > So you should take care that your CPU performance is not suffering. There > is overhead also on the OS layer and it depends on the setup. > > A 36 node cluster on 6 machines adds another challenge. You must tell ES > how your nodes are organized, in order to get a reliable green/yellow/red > cluster health for your shard allocation. > > Jörg > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/156f7ee8-c66c-40e0-9774-e442dd2f1976%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Multiple nodes on a powerful system?
Other than the resource footprint, is there any reason we should avoid running multiple node instances of a cluster on the same machine, assuming all the shard awareness stuff is in place to keep all the copies of a given shard from being stored on those nodes that are all resident on a single physical box? Basically, if I had a cluster of, say six machines with 512GB of ram a piece, is it reasonable to run six instances of ES per machine and 30GB of swap per instance allocated, resulting in a 36 node cluster with and a bit over a terabyte of memory footprint across the cluster? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1df2f7ff-26af-4ce8-a734-29aa9dbd90dd%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Scan/scroll facet data?
I've got fields that have a few hundred thousand+ unique values that I'd like to be able to facet on. Is there some way of essentially streaming the exhaustive list of facet results, like I can search hits? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3b3581eb-a852-444d-ae79-739ec0b56dc8%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Get counts of new users per day
I'm working with the twitter river data. Trying to figure out how to construct a query that would let me generate a count of all new users within a given time period, where in this case, new means "user has not had a post captured before the start of this query window". So basically, I want to get a facet result of user.screen_name where a name will be dropped from the facet entirely if an instance of it occurs outside a specified time range. I've got no idea where to start. Anyone have any pointers? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f1166ce3-f529-4412-b721-37f3e1cfcc79%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Cross index filtering
I've got a backlog of usernames. I need to pull data associated with those usernames into a separate index. I want a query that will let me do a terms facet on user counts in the first index, with a facet filter excluding the users that already exist in the second. Is there a simple way to do this? There could end up being a couple hundred million unique users in the secondary index, in theory. For that scale, what's the best way to approach the problem? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7523648d-f12d-400d-b2d7-7cfb3c3f01c4%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Design practices for hosting multiple clusters/on-demand cluster creation?
While ES is still in a pre deployment stage at my job, there is growing interest in it. For various reasons, a monster cluster holding everyone's stuff is simply not possible. Individual projects require complete control over their data and the culture and security requirements here are such that doing something like always naming project 1's indexes PROJECT_1_ will not fly. We have a fairly beefy hadoop cluster hosting our content currently, along with a separate head node acting as the master. In this situation, is it simply a matter of starting up new processes on each node pointed at different configuration profiles and tying specific ports to specific projects/clusters? Basically, is there an established way to build on-demand clusters, given a set of resources? We'll layer something in front of it to deal with access control/etc. Thanks! -Josh -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ad2695f7-d1a2-4036-82b2-58bddf349681%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Possible to turn off/suppress version data in response to GET http://localhost:9200?
Yeah, it's been looking like a proxy is the way to go. If it was an already existing functionality allowing me to suppress the version info from /, I'd have been happy to use that, but I agree - it isn't worth anyone's time to add this. Thanks Ivan! -Josh On Thursday, December 19, 2013 3:04:15 PM UTC-8, Ivan Brusic wrote: > > Just having the REST endpoint open is a security risk. :) You can always > put a proxy in front of elasticsearch that intercepts certain calls such as > PUT, POST, DELETE or simply / in your case. > > Normally in elasticsearch, a request is built with various parameters via > a builder and then the resulting response will have the correct fields. You > can see an example with the nodes stats: > > > https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/rest/action/admin/cluster/node/stats/RestNodesStatsAction.java > > The main action does not really have specific request/response classes. > You can try raising an issue or even submitting a pull request yourself, > but I do not see this issue as being very important. That is just my guess. > > -- > Ivan > > > On Thu, Dec 19, 2013 at 2:52 PM, Josh Harrison > > wrote: > >> To clarify, when I go to http://localhost:9200, I want to get back >> >> { >> "ok" : true, >> "status" : 200, >> "name" : "Stem Cell", >> "tagline" : "You Know, for Search" >> } >> >> >> Not >> >> { >> "ok" : true, >> "status" : 200, >> "name" : "Stem Cell", >> "version" : { >> "number" : "0.90.5", >> "build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee", >> "build_timestamp" : "2013-09-17T13:09:46Z", >> "build_snapshot" : false, >> "lucene_version" : "4.4" >> }, >> "tagline" : "You Know, for Search" >> } >> >> >> I poked around in the code and the only code place I fine "You Know, for >> Search" is >> >> https://github.com/elasticsearch/elasticsearch/blob/c20d4bb69ed29cf11a747f0fdc40ce4237f79ce4/src/main/java/org/elasticsearch/rest/action/main/RestMainAction.java >> There doesn't appear to be an explicit flag that would allow me to >> suppress that, but perhaps that's somewhere else? My IT folks are in a >> tizzy that version information is being displayed, saying it's a major >> security risk. Sigh. >> Honestly, if it doesn't break something else, I wouldn't mind if there >> was just a way to turn off that default response entirely. That'd do it too. >> >> >> On Thursday, December 19, 2013 12:50:29 PM UTC-8, Ivan Brusic wrote: >> >>> From what I can tell from the code, it appears that you can disable >>> returning the version field. >>> >>> -- >>> Ivan >>> >>> >>> On Thu, Dec 19, 2013 at 12:27 PM, Josh Harrison wrote: >>> >>>> The subject says it all pretty much, is it possible to turn off the >>>> reporting of version data in response to GET http://localhost:9200? >>>> Thanks, >>>> Josh >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to elasticsearc...@googlegroups.com. >>>> >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/7962249a-610f-4ee6-9496-a1cf14df8d95% >>>> 40googlegroups.com. >>>> For more options, visit https://groups.google.com/groups/opt_out. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com . >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/dbd5cd20-6b39-46f8-bab8-b6c37de21c26%40googlegroups.com >> . >> >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1cd01174-f23c-4edd-854f-31a5975e01f4%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: Possible to turn off/suppress version data in response to GET http://localhost:9200?
To clarify, when I go to http://localhost:9200, I want to get back { "ok" : true, "status" : 200, "name" : "Stem Cell", "tagline" : "You Know, for Search" } Not { "ok" : true, "status" : 200, "name" : "Stem Cell", "version" : { "number" : "0.90.5", "build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee", "build_timestamp" : "2013-09-17T13:09:46Z", "build_snapshot" : false, "lucene_version" : "4.4" }, "tagline" : "You Know, for Search" } I poked around in the code and the only code place I fine "You Know, for Search" is https://github.com/elasticsearch/elasticsearch/blob/c20d4bb69ed29cf11a747f0fdc40ce4237f79ce4/src/main/java/org/elasticsearch/rest/action/main/RestMainAction.java There doesn't appear to be an explicit flag that would allow me to suppress that, but perhaps that's somewhere else? My IT folks are in a tizzy that version information is being displayed, saying it's a major security risk. Sigh. Honestly, if it doesn't break something else, I wouldn't mind if there was just a way to turn off that default response entirely. That'd do it too. On Thursday, December 19, 2013 12:50:29 PM UTC-8, Ivan Brusic wrote: > > From what I can tell from the code, it appears that you can disable > returning the version field. > > -- > Ivan > > > On Thu, Dec 19, 2013 at 12:27 PM, Josh Harrison > > wrote: > >> The subject says it all pretty much, is it possible to turn off the >> reporting of version data in response to GET http://localhost:9200? >> Thanks, >> Josh >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com . >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/7962249a-610f-4ee6-9496-a1cf14df8d95%40googlegroups.com >> . >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbd5cd20-6b39-46f8-bab8-b6c37de21c26%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Possible to turn off/suppress version data in response to GET http://localhost:9200?
The subject says it all pretty much, is it possible to turn off the reporting of version data in response to GET http://localhost:9200? Thanks, Josh -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7962249a-610f-4ee6-9496-a1cf14df8d95%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: ES data on Glusterfs
Cool, thanks. It looks like I've conflated gluster and lustre which are unfortunately totally unrelated. We're running lustre. On Tuesday, December 17, 2013 5:58:39 PM UTC-8, Jörg Prante wrote: > > Use Gluster on its native protocol, not on NFS and the like. > > If you want backup/restore, Gluster support was announced as a post 1.0 ES > feature. > > Jörg > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/212a220f-33eb-4026-9454-1d4b84e2ad17%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Auto aliases?
So I know how to set up default mappings. Is it possible to set up default aliases for indexes with a certain name format? I'm pulling in data on a weekly bases, so say I've got 2013_41, and all my filtered aliases (based on the contents of the document) set up. When 2013_42 comes along, do I need to manually create the aliases? Or is there a better way? Thanks! -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d4e0da5d-3878-4d76-9710-fc51898a7107%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
ES data on Glusterfs
Is it a dumb idea to have my ES data directory on a glusterFS/lusterfS storage node? Because we've got some BIG data that we'd like to index, but it's too big for the local storage on our test cluster. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/11c1be2e-a908-454e-92b5-e65f90e33156%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.