I mean what you should also look at is ingestion capacity. If you have a lots 
of irregular writes such as from sensor data, it can make sense to store them 
first in hbase and flush them regularly to Orc/parquet hive tables for analysis 

> On 08 Jun 2016, at 13:15, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Interesting. There is also apache nifi
> 
> Also I note that one can store twitter data in Hive tables as well?
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 7 June 2016 at 15:59, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>> thanks I will have a look.
>> 
>> Mich
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>>> On 7 June 2016 at 13:38, Jörn Franke <jornfra...@gmail.com> wrote:
>>> Solr is basically an in-memory text index with a lot of capabilities for 
>>> language analysis extraction (you can compare  it to a Google for your 
>>> tweets). The system itself has a lot of features and has a complexity 
>>> similar to Big data systems. This index files can be backed by HDFS. You 
>>> can put the tweets directly into solr without going via HDFS files.
>>> 
>>> Carefully decide what fields to index / you want to search. It does not 
>>> make sense to index everything.
>>> 
>>>> On 07 Jun 2016, at 13:51, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>> wrote:
>>>> 
>>>> Ok So basically for predictive off-line (as opposed to streaming) in a 
>>>> nutshell one can use Apache Flume to store twitter data in hdfs and use 
>>>> Solr to query the data?
>>>> 
>>>> This is what it says:
>>>> 
>>>> Solr is a standalone enterprise search server with a REST-like API. You 
>>>> put documents in it (called "indexing") via JSON, XML, CSV or binary over 
>>>> HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary 
>>>> results.
>>>> 
>>>> thanks
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> http://talebzadehmich.wordpress.com
>>>>  
>>>> 
>>>>> On 7 June 2016 at 12:39, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>> Well I have seen that The algorithms mentioned are used for this. However 
>>>>> some preprocessing through solr makes sense - it takes care of synonyms, 
>>>>> homonyms, stemming etc
>>>>> 
>>>>>> On 07 Jun 2016, at 13:33, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> Thanks Jorn,
>>>>>> 
>>>>>> To start I would like to explore how can one turn some of the data into 
>>>>>> useful information.
>>>>>> 
>>>>>> I would like to look at certain trend analysis. Simple correlation shows 
>>>>>> that the more there is a mention of a typical topic say for example 
>>>>>> "organic food" the more people are inclined to go for it. To see one can 
>>>>>> deduce that orgaind food is a potential growth area.
>>>>>> 
>>>>>> Now I have all infra-structure to ingest that data. Like using flume to 
>>>>>> store it or Spark streaming to do near real time work.
>>>>>> 
>>>>>> Now I want to slice and dice that data for say organic food.
>>>>>> 
>>>>>> I presume this is a typical question.
>>>>>> 
>>>>>> You mentioned Spark ml (machine learning?) . Is that something viable?
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>  
>>>>>> 
>>>>>>> On 7 June 2016 at 12:22, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>>>> Spark ml Support Vector machines or neural networks could be 
>>>>>>> candidates. 
>>>>>>> For unstructured learning it could be clustering.
>>>>>>> For doing a graph analysis On the followers you can easily use Spark 
>>>>>>> Graphx
>>>>>>> Keep in mind that each tweet contains a lot of meta data (location, 
>>>>>>> followers etc) that is more or less structured.
>>>>>>> For unstructured text analytics (eg tweet itself)I recommend 
>>>>>>> solr/ElasticSearch .
>>>>>>> 
>>>>>>> However I am not sure what you want to do with the data exactly.
>>>>>>> 
>>>>>>> 
>>>>>>>> On 07 Jun 2016, at 13:16, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> This is really a general question.
>>>>>>>> 
>>>>>>>> I use Spark to get twitter data. I did some looking at it
>>>>>>>> 
>>>>>>>>     val ssc = new StreamingContext(sparkConf, Seconds(2))
>>>>>>>>     val tweets = TwitterUtils.createStream(ssc, None)
>>>>>>>>     val statuses = tweets.map(status => status.getText())
>>>>>>>>     statuses.print()
>>>>>>>> 
>>>>>>>> Ok
>>>>>>>> 
>>>>>>>> Also I can use Apache flume to store data in hdfs directory
>>>>>>>> 
>>>>>>>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf 
>>>>>>>> Dflume.root.logger=DEBUG,console -n TwitterAgent
>>>>>>>> Now that stores twitter data in binary format in  hdfs directory.
>>>>>>>> 
>>>>>>>> My question is pretty basic.
>>>>>>>> 
>>>>>>>> What is the best tool/language to dif in to that data. For example 
>>>>>>>> twitter streaming data. I am getting all sorts od stuff coming in. Say 
>>>>>>>> I am only interested in certain topics like sport etc. How can I 
>>>>>>>> detect the signal from the noise using what tool and language?
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>  
>>>>>>>> LinkedIn  
>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>  
>>>>>>>> http://talebzadehmich.wordpress.com
> 

Reply via email to