Re: Indexing Twitter - Hypothetical

Joseph Obernberger Tue, 08 Mar 2016 08:51:23 -0800

Thank you for the links and explanation.  We are using GATE (General
Architecture for Text Engineering) and parts of the Stanford NER/Parser for
the data that we ingest, but we do not apply it to the queries - only the
data.  We've been concentrating on the back-end, and analytics, not so much
what comes in for queries; something that we need to address.  For this
hypothetical, I wanted to get ideas on what questions would need to be
asked, and how large the system would need to be.  Thank you all very much
for the information so far!
Jack - I want to be a guru-level Solr expert.  :)


-Joe

On Sun, Mar 6, 2016 at 1:29 PM, Walter Underwood <wun...@wunderwood.org>
wrote:

> This is a very good presentation on using entity extraction in query
> understanding. As you’ll see from the preso, it is not easy.
>
>
> http://www.slideshare.net/dtunkelang/better-search-through-query-understanding
> <
> http://www.slideshare.net/dtunkelang/better-search-through-query-understanding
> >
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Mar 6, 2016, at 7:27 AM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
> >
> > Back to the original question... there are two answers:
> >
> > 1. Yes - for guru-level Solr experts.
> > 2. No - for anybody else.
> >
> > For starters, (as always), you would need to do a lot more upfront work
> on
> > mapping out the forms of query which will be supported. For example, is
> > your focus on precision or recall. And, are you looking to analyze all
> > matching tweets or just a sample. And, the load, throughput, and latency
> > requirements. And, any spatial search requirements. And, any entity
> search
> > requirements. Without a clear view of the query requirements it simply
> > isn't possible to even begin defining a data model. And without a data
> > model, indexing is a fool's errand. In short, no focus, no progress.
> >
> > -- Jack Krupansky
> >
> > On Sun, Mar 6, 2016 at 7:42 AM, Susheel Kumar <susheel2...@gmail.com>
> wrote:
> >
> >> Entity Recognition means you may want to recognize different entities
> >> name/person, email, location/city/state/country etc. in your
> >> tweets/messages with goal of  providing better relevant results to
> users.
> >> NER can be used at query or indexing (data enrichment) time.
> >>
> >> Thanks,
> >> Susheel
> >>
> >> On Fri, Mar 4, 2016 at 7:55 PM, Joseph Obernberger <
> >> joseph.obernber...@gmail.com> wrote:
> >>
> >>> Thank you all very much for all the responses so far.  I've enjoyed
> >> reading
> >>> them!  We have noticed that storing data inside of Solr results in
> >>> significantly worse performance (particularly faceting); so we store
> the
> >>> values of all the fields elsewhere, but index all the data with Solr
> >>> Cloud.  I think the suggestion about splitting the data up into blocks
> of
> >>> date/time is where we would be headed.  Having two Solr-Cloud clusters
> -
> >>> one to handle ~30 days of data, and one to handle historical.  Another
> >>> option is to use a single Solr Cloud cluster, but use multiple
> >>> cores/collections.  Either way you'd need a job to come through and
> clean
> >>> up old data. The historical cluster would have much worse performance,
> >>> particularly for clustering and faceting the data, but that may be
> >>> acceptable.
> >>> I don't know what you mean by 'entity recognition in the queries' -
> could
> >>> you elaborate?
> >>>
> >>> We would want to index and potentially facet on any of the fields - for
> >>> example entities_media_url, username, even background color, but we do
> >> not
> >>> know a-priori what fields will be important to users.
> >>> As to why we would want to make the data searchable; well - I don't
> make
> >>> the rules!  Tweets is not the only data source, but it's certainly the
> >>> largest that we are currently looking at handling.
> >>>
> >>> I will read up on the Berlin Buzzwords - thank you for the info!
> >>>
> >>> -Joe
> >>>
> >>>
> >>>
> >>> On Fri, Mar 4, 2016 at 9:59 AM, Jack Krupansky <
> jack.krupan...@gmail.com
> >>>
> >>> wrote:
> >>>
> >>>> As always, the initial question is how you intend to query the data -
> >>> query
> >>>> drives data modeling. How real-time do you need queries to be? How
> fast
> >>> do
> >>>> you need archive queries to be? How many fields do you need to query
> >> on?
> >>>> How much entity recognition do you need in queries?
> >>>>
> >>>>
> >>>> -- Jack Krupansky
> >>>>
> >>>> On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull <char...@flax.co.uk>
> >> wrote:
> >>>>
> >>>>> On 03/03/2016 19:25, Toke Eskildsen wrote:
> >>>>>
> >>>>>> Joseph Obernberger <joseph.obernber...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Hi All - would it be reasonable to index the Twitter 'firehose'
> >>>>>>> with Solr Cloud - roughly 500-600 million docs per day indexing
> >>>>>>> each of the fields (about 180)?
> >>>>>>>
> >>>>>>
> >>>>>> Possible, yes. Reasonable? It is not going to be cheap.
> >>>>>>
> >>>>>> Twitter index the tweets themselves and have been quite open about
> >>>>>> how they do it. I would suggest looking for their presentations;
> >>>>>> slides or recordings. They have presented at Berlin Buzzwords and
> >>>>>> Lucene/Solr Revolution and probably elsewhere too. The gist is that
> >>>>>> they have done a lot of work and custom coding to handle it.
> >>>>>>
> >>>>>
> >>>>> As I recall they're not using Solr, but rather an in-house layer
> >> built
> >>> on
> >>>>> a customised version of Lucene. They're indexing around half a
> >> trillion
> >>>>> tweets.
> >>>>>
> >>>>> If the idea is to provide a searchable archive of all tweets, my
> >> first
> >>>>> question would be 'why': if the idea is to monitor new tweets for
> >>>>> particular patterns there are better ways to do this (Luwak for
> >>> example).
> >>>>>
> >>>>> Charlie
> >>>>>
> >>>>>
> >>>>>> If I were to guess at a sharded setup to handle such data, and keep
> >>>>>>> 2 years worth, I would guess about 2500 shards.  Is that
> >>>>>>> reasonable?
> >>>>>>>
> >>>>>>
> >>>>>> I think you need to think well beyond standard SolrCloud setups.
> >> Even
> >>>>>> if you manage to get 2500 shards running, you will want to do a lot
> >>>>>> of tweaking on the way to issue queries so that each request does
> >> not
> >>>>>> require all 2500 shards to be searched. Prioritizing newer material
> >>>>>> and only query the older shards if there is not enough resent
> >> results
> >>>>>> is an example.
> >>>>>>
> >>>>>> I highly doubt that a single SolrCloud is the best answer here.
> >> Maybe
> >>>>>> one cloud for each month and a lot of external logic?
> >>>>>>
> >>>>>> - Toke Eskildsen
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Charlie Hull
> >>>>> Flax - Open Source Enterprise Search
> >>>>>
> >>>>> tel/fax: +44 (0)8700 118334
> >>>>> mobile:  +44 (0)7767 825828
> >>>>> web: www.flax.co.uk
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: Indexing Twitter - Hypothetical

Reply via email to