I think some of the Twitter's need to index in a particular way comes from their real-time need. So, that's part of the decision for the original poster, on how responsive data needs to be.
As to the rest, I think the company that shows twitter messages on TV does something similar with Solr. They were presenting at Revolution 2014 (one before last) I think. I forgot their name (they changed it once or twice...) Regards, Alex. ---- Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 4 March 2016 at 06:25, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > Joseph Obernberger <joseph.obernber...@gmail.com> wrote: >> Hi All - would it be reasonable to index the Twitter 'firehose' with Solr >> Cloud - roughly 500-600 million docs per day indexing each of the fields >> (about 180)? > > Possible, yes. Reasonable? It is not going to be cheap. > > Twitter index the tweets themselves and have been quite open about how they > do it. I would suggest looking for their presentations; slides or recordings. > They have presented at Berlin Buzzwords and Lucene/Solr Revolution and > probably elsewhere too. The gist is that they have done a lot of work and > custom coding to handle it. > >> If I were to guess at a sharded setup to handle such data, and keep 2 years >> worth, I would guess about 2500 shards. Is that reasonable? > > I think you need to think well beyond standard SolrCloud setups. Even if you > manage to get 2500 shards running, you will want to do a lot of tweaking on > the way to issue queries so that each request does not require all 2500 > shards to be searched. Prioritizing newer material and only query the older > shards if there is not enough resent results is an example. > > I highly doubt that a single SolrCloud is the best answer here. Maybe one > cloud for each month and a lot of external logic? > > - Toke Eskildsen