thanks I will have a look. Mich
Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 7 June 2016 at 13:38, Jörn Franke <jornfra...@gmail.com> wrote: > Solr is basically an in-memory text index with a lot of capabilities for > language analysis extraction (you can compare it to a Google for your > tweets). The system itself has a lot of features and has a complexity > similar to Big data systems. This index files can be backed by HDFS. You > can put the tweets directly into solr without going via HDFS files. > > Carefully decide what fields to index / you want to search. It does not > make sense to index everything. > > On 07 Jun 2016, at 13:51, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > Ok So basically for predictive off-line (as opposed to streaming) in a > nutshell one can use Apache Flume to store twitter data in hdfs and use > Solr to query the data? > > This is what it says: > > Solr is a standalone enterprise search server with a REST-like API. You > put documents in it (called "indexing") via JSON, XML, CSV or binary over > HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary > results. > > thanks > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 7 June 2016 at 12:39, Jörn Franke <jornfra...@gmail.com> wrote: > >> Well I have seen that The algorithms mentioned are used for this. However >> some preprocessing through solr makes sense - it takes care of synonyms, >> homonyms, stemming etc >> >> On 07 Jun 2016, at 13:33, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Thanks Jorn, >> >> To start I would like to explore how can one turn some of the data into >> useful information. >> >> I would like to look at certain trend analysis. Simple correlation shows >> that the more there is a mention of a typical topic say for example >> "organic food" the more people are inclined to go for it. To see one can >> deduce that orgaind food is a potential growth area. >> >> Now I have all infra-structure to ingest that data. Like using flume to >> store it or Spark streaming to do near real time work. >> >> Now I want to slice and dice that data for say organic food. >> >> I presume this is a typical question. >> >> You mentioned Spark ml (machine learning?) . Is that something viable? >> >> Cheers >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 7 June 2016 at 12:22, Jörn Franke <jornfra...@gmail.com> wrote: >> >>> Spark ml Support Vector machines or neural networks could be candidates. >>> For unstructured learning it could be clustering. >>> For doing a graph analysis On the followers you can easily use Spark >>> Graphx >>> Keep in mind that each tweet contains a lot of meta data (location, >>> followers etc) that is more or less structured. >>> For unstructured text analytics (eg tweet itself)I recommend >>> solr/ElasticSearch . >>> >>> However I am not sure what you want to do with the data exactly. >>> >>> >>> On 07 Jun 2016, at 13:16, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>> Hi, >>> >>> This is really a general question. >>> >>> I use Spark to get twitter data. I did some looking at it >>> >>> val ssc = new StreamingContext(sparkConf, Seconds(2)) >>> val tweets = TwitterUtils.createStream(ssc, None) >>> val statuses = tweets.map(status => status.getText()) >>> statuses.print() >>> >>> Ok >>> >>> Also I can use Apache flume to store data in hdfs directory >>> >>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf >>> Dflume.root.logger=DEBUG,console -n TwitterAgent >>> Now that stores twitter data in binary format in hdfs directory. >>> >>> My question is pretty basic. >>> >>> What is the best tool/language to dif in to that data. For example >>> twitter streaming data. I am getting all sorts od stuff coming in. Say I am >>> only interested in certain topics like sport etc. How can I detect the >>> signal from the noise using what tool and language? >>> >>> Thanks >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> >> >