Re: Indexing Twitter - Hypothetical

Alexandre Rafalovitch Thu, 03 Mar 2016 15:36:06 -0800

I think some of the Twitter's need to index in a particular way comes
from their real-time need. So, that's part of the decision for the
original poster, on how responsive data needs to be.


As to the rest, I think the company that shows twitter messages on TV
does something similar with Solr. They were presenting at Revolution
2014 (one before last) I think.  I forgot their name (they changed it
once or twice...)

Regards,
    Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 4 March 2016 at 06:25, Toke Eskildsen <t...@statsbiblioteket.dk> wrote:
> Joseph Obernberger <joseph.obernber...@gmail.com> wrote:
>> Hi All - would it be reasonable to index the Twitter 'firehose' with Solr
>> Cloud - roughly 500-600 million docs per day indexing each of the fields
>> (about 180)?
>
> Possible, yes. Reasonable? It is not going to be cheap.
>
> Twitter index the tweets themselves and have been quite open about how they 
> do it. I would suggest looking for their presentations; slides or recordings. 
> They have presented at Berlin Buzzwords and Lucene/Solr Revolution and 
> probably elsewhere too. The gist is that they have done a lot of work and 
> custom coding to handle it.
>
>> If I were to guess at a sharded setup to handle such data, and keep 2 years
>> worth, I would guess about 2500 shards.  Is that reasonable?
>
> I think you need to think well beyond standard SolrCloud setups. Even if you 
> manage to get 2500 shards running, you will want to do a lot of tweaking on 
> the way to issue queries so that each request does not require all 2500 
> shards to be searched. Prioritizing newer material and only query the older 
> shards if there is not enough resent results is an example.
>
> I highly doubt that a single SolrCloud is the best answer here. Maybe one 
> cloud for each month and a lot of external logic?
>
> - Toke Eskildsen

Re: Indexing Twitter - Hypothetical

Reply via email to