Re: Analyzing twitter data

2016-06-08 Thread Jörn Franke
You can directly load it into solr.
But think about what you want to index etc.


> On 08 Jun 2016, at 15:51, Mich Talebzadeh  wrote:
> 
> yes. use that is reasonable.
> 
> What is the format of twitter data. Is that primarily json.?
> 
> If I do
> 
> duser@rhes564: /usr/lib/nifi-0.6.1/conf> hdfs dfs -cat 
> /twitter_data/FlumeData.1464945101915|more
> 
> 16/06/08 14:48:36 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> {"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","
> type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":[
> "string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_
> reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_ur
> l","type":["string","null"]}]}
> ▒ぷろが説明書✋http://twpf.jp/960_krm
> 8659292711026688
> ▒便座カバー
> 男児「こちらロストボーイ1、"施設"に侵入した。セキュリティが厄介で数日かかるな。偽装工作はうまくいっているか?」
> 父親「あぁ今じゃ立派なダメ親父と可哀想な子供扱いだ。これで万一見つかってもお前は安心さ」
> 男…
> ter.com" rel="nofollow">Twitter Web Clients, and fun!
> Learning a new la... https://t.co/ejHfRcAucy
> 3-2 7番 代議員
> 木村亮太
> 知ってる人はRT
> ちびまる子ちゃんのEDですね!
> この曲は12年前になります。 https://t.co/LfLZ8xX5u9
> twitter.com/hijidora/status/737980634858029056/video/1$738659292677431296
> (ライトニングさんティナラムザアダマンA)/かんこれ/EXVS(モチベ↑雑魚後衛)/FGOその他適当
> ▒http://twitter.com/download/iphone; rel="nofollow">Twitter for 
> iPhone
> itter.com/download/android" rel="nofollow">Twitter for Android Free Lyft 
> credit with Lyft promo code LYFTLUSHpp.com" rel="nofollow">Buffer
> naliar for iPad
> third person
> 男子南ことりが大好きなラブライバーです! ラブライブ大好きな人ぜひフォローしてください
> 固定ツイートお願いします
> ラブライブに出会えて良かった!
> 9人のみんなのこと忘れない
> #LoveLiveforever
> #ラブライバーと繋がりたいRT https://t.co/kITPDLER9x
> 07114803986434/photo/1$738659292685979648
> :13Z://pbs.twimg.com/media/CkA-exTWYAAK8TU.jpg
> : 1000RT:【資金不足】「学園ハンサム」、クラウドファンディングでアニメ化支援を募集
> https://t.co/CVM2F7rNt1
> 放送局やキャストは「支援額に応じて変わる」とのこと。時期は10月から1クールと発表されている。 http…
> com/media/CkAftVyUYAA0nmn.jpg-06-03T10:11:13Z
> miga, sutiã que é do dia a dia ela só usa de ser obrigada, acha mesmo q ela 
> compraria mais d…r Promoter | Worked with @inkmonstarr @breadboi @ayookd 
> @chapobandz and more | PayPal accepted | DM for beats | Beats Starting at $10 
> |resenting August Redmoon at the Hollywood premiere of Inside Metal: The 
> Metal Scene Explodes! 落落落落 https://t.…
> .jpg
> 
> -03T10:11:13Z
>  
> I assume it is all json data. So I can use solr to build index on these files 
> and do a search?
> 
> Or alternatively use it for a staging area for Hive table?
> 
> 
> thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 8 June 2016 at 14:01, Jörn Franke  wrote:
>> I mean what you should also look at is ingestion capacity. If you have a 
>> lots of irregular writes such as from sensor data, it can make sense to 
>> store them first in hbase and flush them regularly to Orc/parquet hive 
>> tables for analysis 
>> 
>>> On 08 Jun 2016, at 13:15, Mich Talebzadeh  wrote:
>>> 
>> 
>>> Interesting. There is also apache nifi
>>> 
>>> Also I note that one can store twitter data in Hive tables as well?
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
 On 7 June 2016 at 15:59, Mich Talebzadeh  wrote:
 thanks I will have a look.
 
 Mich
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com
  
 
> On 7 June 2016 at 13:38, Jörn Franke  wrote:
> Solr is basically an in-memory text index with a lot of capabilities for 
> language analysis extraction (you can compare  it to a Google for your 
> tweets). The system itself has a lot of features and has a complexity 
> similar to Big data systems. This index files can be backed by HDFS. You 
> can put the tweets directly into solr without going via HDFS files.
> 
> Carefully decide what fields to index / you want to search. It does not 
> make sense to index 

Re: Analyzing twitter data

2016-06-08 Thread Mich Talebzadeh
yes. use that is reasonable.

What is the format of twitter data. Is that primarily json.?

If I do

*duser@rhes564: /usr/lib/nifi-0.6.1/conf> hdfs dfs -cat
/twitter_data/FlumeData.1464945101915|more*

16/06/08 14:48:36 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","
type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":[
"string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_
reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_ur
l","type":["string","null"]}]}
▒ぷろが説明書✋http://twpf.jp/960_krm
8659292711026688
▒便座カバー
男児「こちらロストボーイ1、"施設"に侵入した。セキュリティが厄介で数日かかるな。偽装工作はうまくいっているか?」
父親「あぁ今じゃ立派なダメ親父と可哀想な子供扱いだ。これで万一見つかってもお前は安心さ」
男…
ter.com" rel="nofollow">Twitter Web Clients, and fun!
Learning a new la... https://t.co/ejHfRcAucy
3-2 7番 代議員
木村亮太
知ってる人はRT
ちびまる子ちゃんのEDですね!
この曲は12年前になります。 https://t.co/LfLZ8xX5u9
twitter.com/hijidora/status/737980634858029056/video/1$738659292677431296
(ライトニングさんティナラムザアダマンA)/かんこれ/EXVS(モチベ↑雑魚後衛)/FGOその他適当
▒http://twitter.com/download/iphone; rel="nofollow">Twitter for
iPhone
itter.com/download/android" rel="nofollow">Twitter for Android Free
Lyft credit with Lyft promo code LYFTLUSHpp.com" rel="nofollow">Buffer
naliar for iPad
third person
男子南ことりが大好きなラブライバーです! ラブライブ大好きな人ぜひフォローしてください
固定ツイートお願いします
ラブライブに出会えて良かった!
9人のみんなのこと忘れない
#LoveLiveforever
#ラブライバーと繋がりたいRT https://t.co/kITPDLER9x
07114803986434/photo/1$738659292685979648
:13Z://pbs.twimg.com/media/CkA-exTWYAAK8TU.jpg
: 1000RT:【資金不足】「学園ハンサム」、クラウドファンディングでアニメ化支援を募集
https://t.co/CVM2F7rNt1
放送局やキャストは「支援額に応じて変わる」とのこと。時期は10月から1クールと発表されている。 http…
com/media/CkAftVyUYAA0nmn.jpg-06-03T10:11:13Z
miga, sutiã que é do dia a dia ela só usa de ser obrigada, acha mesmo q ela
compraria mais d…r Promoter | Worked with @inkmonstarr @breadboi @ayookd
@chapobandz and more | PayPal accepted | DM for beats | Beats Starting at
$10 |resenting August Redmoon at the Hollywood premiere of Inside Metal:
The Metal Scene Explodes! 落落落落 https://t.…
.jpg

-03T10:11:13Z

I assume it is all json data. So I can use solr to build index on these
files and do a search?

Or alternatively use it for a staging area for Hive table?


thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 8 June 2016 at 14:01, Jörn Franke  wrote:

> I mean what you should also look at is ingestion capacity. If you have a
> lots of irregular writes such as from sensor data, it can make sense to
> store them first in hbase and flush them regularly to Orc/parquet hive
> tables for analysis
>
> On 08 Jun 2016, at 13:15, Mich Talebzadeh 
> wrote:
>
> Interesting. There is also apache nifi 
>
> Also I note that one can store twitter data in Hive tables as well?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 15:59, Mich Talebzadeh 
> wrote:
>
>> thanks I will have a look.
>>
>> Mich
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 13:38, Jörn Franke  wrote:
>>
>>> Solr is basically an in-memory text index with a lot of capabilities for
>>> language analysis extraction (you can compare  it to a Google for your
>>> tweets). The system itself has a lot of features and has a complexity
>>> similar to Big data systems. This index files can be backed by HDFS. You
>>> can put the tweets directly into solr without going via HDFS files.
>>>
>>> Carefully decide what fields to index / you want to search. It does not
>>> make sense to index everything.
>>>
>>> On 07 Jun 2016, at 13:51, Mich Talebzadeh 
>>> wrote:
>>>
>>> Ok So basically 

Re: Analyzing twitter data

2016-06-08 Thread Jörn Franke
That is trivial to do , I did it once when they were in json format

> On 08 Jun 2016, at 13:15, Mich Talebzadeh  wrote:
> 
> Interesting. There is also apache nifi
> 
> Also I note that one can store twitter data in Hive tables as well?
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 7 June 2016 at 15:59, Mich Talebzadeh  wrote:
>> thanks I will have a look.
>> 
>> Mich
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>>> On 7 June 2016 at 13:38, Jörn Franke  wrote:
>>> Solr is basically an in-memory text index with a lot of capabilities for 
>>> language analysis extraction (you can compare  it to a Google for your 
>>> tweets). The system itself has a lot of features and has a complexity 
>>> similar to Big data systems. This index files can be backed by HDFS. You 
>>> can put the tweets directly into solr without going via HDFS files.
>>> 
>>> Carefully decide what fields to index / you want to search. It does not 
>>> make sense to index everything.
>>> 
 On 07 Jun 2016, at 13:51, Mich Talebzadeh  
 wrote:
 
 Ok So basically for predictive off-line (as opposed to streaming) in a 
 nutshell one can use Apache Flume to store twitter data in hdfs and use 
 Solr to query the data?
 
 This is what it says:
 
 Solr is a standalone enterprise search server with a REST-like API. You 
 put documents in it (called "indexing") via JSON, XML, CSV or binary over 
 HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary 
 results.
 
 thanks
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com
  
 
> On 7 June 2016 at 12:39, Jörn Franke  wrote:
> Well I have seen that The algorithms mentioned are used for this. However 
> some preprocessing through solr makes sense - it takes care of synonyms, 
> homonyms, stemming etc
> 
>> On 07 Jun 2016, at 13:33, Mich Talebzadeh  
>> wrote:
>> 
>> Thanks Jorn,
>> 
>> To start I would like to explore how can one turn some of the data into 
>> useful information.
>> 
>> I would like to look at certain trend analysis. Simple correlation shows 
>> that the more there is a mention of a typical topic say for example 
>> "organic food" the more people are inclined to go for it. To see one can 
>> deduce that orgaind food is a potential growth area.
>> 
>> Now I have all infra-structure to ingest that data. Like using flume to 
>> store it or Spark streaming to do near real time work.
>> 
>> Now I want to slice and dice that data for say organic food.
>> 
>> I presume this is a typical question.
>> 
>> You mentioned Spark ml (machine learning?) . Is that something viable?
>> 
>> Cheers
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>>> On 7 June 2016 at 12:22, Jörn Franke  wrote:
>>> Spark ml Support Vector machines or neural networks could be 
>>> candidates. 
>>> For unstructured learning it could be clustering.
>>> For doing a graph analysis On the followers you can easily use Spark 
>>> Graphx
>>> Keep in mind that each tweet contains a lot of meta data (location, 
>>> followers etc) that is more or less structured.
>>> For unstructured text analytics (eg tweet itself)I recommend 
>>> solr/ElasticSearch .
>>> 
>>> However I am not sure what you want to do with the data exactly.
>>> 
>>> 
 On 07 Jun 2016, at 13:16, Mich Talebzadeh  
 wrote:
 
 Hi,
 
 This is really a general question.
 
 I use Spark to get twitter data. I did some looking at it
 
 val ssc = new StreamingContext(sparkConf, Seconds(2))
 val tweets = TwitterUtils.createStream(ssc, None)
 val statuses = tweets.map(status => status.getText())
 statuses.print()
 
 Ok
 
 Also I can use Apache flume to store data in hdfs directory
 
 $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf 
 Dflume.root.logger=DEBUG,console -n TwitterAgent
 Now that stores twitter data in binary format in  hdfs 

Re: Analyzing twitter data

2016-06-08 Thread Mich Talebzadeh
Interesting. There is also apache nifi 

Also I note that one can store twitter data in Hive tables as well?



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 15:59, Mich Talebzadeh  wrote:

> thanks I will have a look.
>
> Mich
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 13:38, Jörn Franke  wrote:
>
>> Solr is basically an in-memory text index with a lot of capabilities for
>> language analysis extraction (you can compare  it to a Google for your
>> tweets). The system itself has a lot of features and has a complexity
>> similar to Big data systems. This index files can be backed by HDFS. You
>> can put the tweets directly into solr without going via HDFS files.
>>
>> Carefully decide what fields to index / you want to search. It does not
>> make sense to index everything.
>>
>> On 07 Jun 2016, at 13:51, Mich Talebzadeh 
>> wrote:
>>
>> Ok So basically for predictive off-line (as opposed to streaming) in a
>> nutshell one can use Apache Flume to store twitter data in hdfs and use
>> Solr to query the data?
>>
>> This is what it says:
>>
>> Solr is a standalone enterprise search server with a REST-like API. You
>> put documents in it (called "indexing") via JSON, XML, CSV or binary over
>> HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary
>> results.
>>
>> thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 12:39, Jörn Franke  wrote:
>>
>>> Well I have seen that The algorithms mentioned are used for this.
>>> However some preprocessing through solr makes sense - it takes care of
>>> synonyms, homonyms, stemming etc
>>>
>>> On 07 Jun 2016, at 13:33, Mich Talebzadeh 
>>> wrote:
>>>
>>> Thanks Jorn,
>>>
>>> To start I would like to explore how can one turn some of the data into
>>> useful information.
>>>
>>> I would like to look at certain trend analysis. Simple correlation shows
>>> that the more there is a mention of a typical topic say for example
>>> "organic food" the more people are inclined to go for it. To see one can
>>> deduce that orgaind food is a potential growth area.
>>>
>>> Now I have all infra-structure to ingest that data. Like using flume to
>>> store it or Spark streaming to do near real time work.
>>>
>>> Now I want to slice and dice that data for say organic food.
>>>
>>> I presume this is a typical question.
>>>
>>> You mentioned Spark ml (machine learning?) . Is that something viable?
>>>
>>> Cheers
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 7 June 2016 at 12:22, Jörn Franke  wrote:
>>>
 Spark ml Support Vector machines or neural networks could be
 candidates.
 For unstructured learning it could be clustering.
 For doing a graph analysis On the followers you can easily use Spark
 Graphx
 Keep in mind that each tweet contains a lot of meta data (location,
 followers etc) that is more or less structured.
 For unstructured text analytics (eg tweet itself)I recommend
 solr/ElasticSearch .

 However I am not sure what you want to do with the data exactly.


 On 07 Jun 2016, at 13:16, Mich Talebzadeh 
 wrote:

 Hi,

 This is really a general question.

 I use Spark to get twitter data. I did some looking at it

 val ssc = new StreamingContext(sparkConf, Seconds(2))
 val tweets = TwitterUtils.createStream(ssc, None)
 val statuses = tweets.map(status => status.getText())
 statuses.print()

 Ok

 Also I can use Apache flume to store data in hdfs directory

 $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
 Dflume.root.logger=DEBUG,console -n TwitterAgent
 Now that stores twitter data in binary format in  hdfs directory.

 My question is pretty basic.

 What is the best tool/language to dif in to that data. For example
 twitter 

Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
thanks I will have a look.

Mich

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 13:38, Jörn Franke  wrote:

> Solr is basically an in-memory text index with a lot of capabilities for
> language analysis extraction (you can compare  it to a Google for your
> tweets). The system itself has a lot of features and has a complexity
> similar to Big data systems. This index files can be backed by HDFS. You
> can put the tweets directly into solr without going via HDFS files.
>
> Carefully decide what fields to index / you want to search. It does not
> make sense to index everything.
>
> On 07 Jun 2016, at 13:51, Mich Talebzadeh 
> wrote:
>
> Ok So basically for predictive off-line (as opposed to streaming) in a
> nutshell one can use Apache Flume to store twitter data in hdfs and use
> Solr to query the data?
>
> This is what it says:
>
> Solr is a standalone enterprise search server with a REST-like API. You
> put documents in it (called "indexing") via JSON, XML, CSV or binary over
> HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary
> results.
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 12:39, Jörn Franke  wrote:
>
>> Well I have seen that The algorithms mentioned are used for this. However
>> some preprocessing through solr makes sense - it takes care of synonyms,
>> homonyms, stemming etc
>>
>> On 07 Jun 2016, at 13:33, Mich Talebzadeh 
>> wrote:
>>
>> Thanks Jorn,
>>
>> To start I would like to explore how can one turn some of the data into
>> useful information.
>>
>> I would like to look at certain trend analysis. Simple correlation shows
>> that the more there is a mention of a typical topic say for example
>> "organic food" the more people are inclined to go for it. To see one can
>> deduce that orgaind food is a potential growth area.
>>
>> Now I have all infra-structure to ingest that data. Like using flume to
>> store it or Spark streaming to do near real time work.
>>
>> Now I want to slice and dice that data for say organic food.
>>
>> I presume this is a typical question.
>>
>> You mentioned Spark ml (machine learning?) . Is that something viable?
>>
>> Cheers
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 12:22, Jörn Franke  wrote:
>>
>>> Spark ml Support Vector machines or neural networks could be candidates.
>>> For unstructured learning it could be clustering.
>>> For doing a graph analysis On the followers you can easily use Spark
>>> Graphx
>>> Keep in mind that each tweet contains a lot of meta data (location,
>>> followers etc) that is more or less structured.
>>> For unstructured text analytics (eg tweet itself)I recommend
>>> solr/ElasticSearch .
>>>
>>> However I am not sure what you want to do with the data exactly.
>>>
>>>
>>> On 07 Jun 2016, at 13:16, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> This is really a general question.
>>>
>>> I use Spark to get twitter data. I did some looking at it
>>>
>>> val ssc = new StreamingContext(sparkConf, Seconds(2))
>>> val tweets = TwitterUtils.createStream(ssc, None)
>>> val statuses = tweets.map(status => status.getText())
>>> statuses.print()
>>>
>>> Ok
>>>
>>> Also I can use Apache flume to store data in hdfs directory
>>>
>>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
>>> Dflume.root.logger=DEBUG,console -n TwitterAgent
>>> Now that stores twitter data in binary format in  hdfs directory.
>>>
>>> My question is pretty basic.
>>>
>>> What is the best tool/language to dif in to that data. For example
>>> twitter streaming data. I am getting all sorts od stuff coming in. Say I am
>>> only interested in certain topics like sport etc. How can I detect the
>>> signal from the noise using what tool and language?
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
>>
>


Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke
Solr is basically an in-memory text index with a lot of capabilities for 
language analysis extraction (you can compare  it to a Google for your tweets). 
The system itself has a lot of features and has a complexity similar to Big 
data systems. This index files can be backed by HDFS. You can put the tweets 
directly into solr without going via HDFS files.

Carefully decide what fields to index / you want to search. It does not make 
sense to index everything.

> On 07 Jun 2016, at 13:51, Mich Talebzadeh  wrote:
> 
> Ok So basically for predictive off-line (as opposed to streaming) in a 
> nutshell one can use Apache Flume to store twitter data in hdfs and use Solr 
> to query the data?
> 
> This is what it says:
> 
> Solr is a standalone enterprise search server with a REST-like API. You put 
> documents in it (called "indexing") via JSON, XML, CSV or binary over HTTP. 
> You query it via HTTP GET and receive JSON, XML, CSV or binary results.
> 
> thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 7 June 2016 at 12:39, Jörn Franke  wrote:
>> Well I have seen that The algorithms mentioned are used for this. However 
>> some preprocessing through solr makes sense - it takes care of synonyms, 
>> homonyms, stemming etc
>> 
>>> On 07 Jun 2016, at 13:33, Mich Talebzadeh  wrote:
>>> 
>>> Thanks Jorn,
>>> 
>>> To start I would like to explore how can one turn some of the data into 
>>> useful information.
>>> 
>>> I would like to look at certain trend analysis. Simple correlation shows 
>>> that the more there is a mention of a typical topic say for example 
>>> "organic food" the more people are inclined to go for it. To see one can 
>>> deduce that orgaind food is a potential growth area.
>>> 
>>> Now I have all infra-structure to ingest that data. Like using flume to 
>>> store it or Spark streaming to do near real time work.
>>> 
>>> Now I want to slice and dice that data for say organic food.
>>> 
>>> I presume this is a typical question.
>>> 
>>> You mentioned Spark ml (machine learning?) . Is that something viable?
>>> 
>>> Cheers
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
 On 7 June 2016 at 12:22, Jörn Franke  wrote:
 Spark ml Support Vector machines or neural networks could be candidates. 
 For unstructured learning it could be clustering.
 For doing a graph analysis On the followers you can easily use Spark Graphx
 Keep in mind that each tweet contains a lot of meta data (location, 
 followers etc) that is more or less structured.
 For unstructured text analytics (eg tweet itself)I recommend 
 solr/ElasticSearch .
 
 However I am not sure what you want to do with the data exactly.
 
 
> On 07 Jun 2016, at 13:16, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> This is really a general question.
> 
> I use Spark to get twitter data. I did some looking at it
> 
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> val tweets = TwitterUtils.createStream(ssc, None)
> val statuses = tweets.map(status => status.getText())
> statuses.print()
> 
> Ok
> 
> Also I can use Apache flume to store data in hdfs directory
> 
> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf 
> Dflume.root.logger=DEBUG,console -n TwitterAgent
> Now that stores twitter data in binary format in  hdfs directory.
> 
> My question is pretty basic.
> 
> What is the best tool/language to dif in to that data. For example 
> twitter streaming data. I am getting all sorts od stuff coming in. Say I 
> am only interested in certain topics like sport etc. How can I detect the 
> signal from the noise using what tool and language?
> 
> Thanks
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 


Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
Ok So basically for predictive off-line (as opposed to streaming) in a
nutshell one can use Apache Flume to store twitter data in hdfs and use
Solr to query the data?

This is what it says:

Solr is a standalone enterprise search server with a REST-like API. You put
documents in it (called "indexing") via JSON, XML, CSV or binary over HTTP.
You query it via HTTP GET and receive JSON, XML, CSV or binary results.

thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 12:39, Jörn Franke  wrote:

> Well I have seen that The algorithms mentioned are used for this. However
> some preprocessing through solr makes sense - it takes care of synonyms,
> homonyms, stemming etc
>
> On 07 Jun 2016, at 13:33, Mich Talebzadeh 
> wrote:
>
> Thanks Jorn,
>
> To start I would like to explore how can one turn some of the data into
> useful information.
>
> I would like to look at certain trend analysis. Simple correlation shows
> that the more there is a mention of a typical topic say for example
> "organic food" the more people are inclined to go for it. To see one can
> deduce that orgaind food is a potential growth area.
>
> Now I have all infra-structure to ingest that data. Like using flume to
> store it or Spark streaming to do near real time work.
>
> Now I want to slice and dice that data for say organic food.
>
> I presume this is a typical question.
>
> You mentioned Spark ml (machine learning?) . Is that something viable?
>
> Cheers
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 12:22, Jörn Franke  wrote:
>
>> Spark ml Support Vector machines or neural networks could be candidates.
>> For unstructured learning it could be clustering.
>> For doing a graph analysis On the followers you can easily use Spark
>> Graphx
>> Keep in mind that each tweet contains a lot of meta data (location,
>> followers etc) that is more or less structured.
>> For unstructured text analytics (eg tweet itself)I recommend
>> solr/ElasticSearch .
>>
>> However I am not sure what you want to do with the data exactly.
>>
>>
>> On 07 Jun 2016, at 13:16, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> This is really a general question.
>>
>> I use Spark to get twitter data. I did some looking at it
>>
>> val ssc = new StreamingContext(sparkConf, Seconds(2))
>> val tweets = TwitterUtils.createStream(ssc, None)
>> val statuses = tweets.map(status => status.getText())
>> statuses.print()
>>
>> Ok
>>
>> Also I can use Apache flume to store data in hdfs directory
>>
>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
>> Dflume.root.logger=DEBUG,console -n TwitterAgent
>> Now that stores twitter data in binary format in  hdfs directory.
>>
>> My question is pretty basic.
>>
>> What is the best tool/language to dif in to that data. For example
>> twitter streaming data. I am getting all sorts od stuff coming in. Say I am
>> only interested in certain topics like sport etc. How can I detect the
>> signal from the noise using what tool and language?
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>


Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke
Well I have seen that The algorithms mentioned are used for this. However some 
preprocessing through solr makes sense - it takes care of synonyms, homonyms, 
stemming etc

> On 07 Jun 2016, at 13:33, Mich Talebzadeh  wrote:
> 
> Thanks Jorn,
> 
> To start I would like to explore how can one turn some of the data into 
> useful information.
> 
> I would like to look at certain trend analysis. Simple correlation shows that 
> the more there is a mention of a typical topic say for example "organic food" 
> the more people are inclined to go for it. To see one can deduce that orgaind 
> food is a potential growth area. 
> 
> Now I have all infra-structure to ingest that data. Like using flume to store 
> it or Spark streaming to do near real time work.
> 
> Now I want to slice and dice that data for say organic food.
> 
> I presume this is a typical question.
> 
> You mentioned Spark ml (machine learning?) . Is that something viable?
> 
> Cheers
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 7 June 2016 at 12:22, Jörn Franke  wrote:
>> Spark ml Support Vector machines or neural networks could be candidates. 
>> For unstructured learning it could be clustering.
>> For doing a graph analysis On the followers you can easily use Spark Graphx
>> Keep in mind that each tweet contains a lot of meta data (location, 
>> followers etc) that is more or less structured.
>> For unstructured text analytics (eg tweet itself)I recommend 
>> solr/ElasticSearch .
>> 
>> However I am not sure what you want to do with the data exactly.
>> 
>> 
>>> On 07 Jun 2016, at 13:16, Mich Talebzadeh  wrote:
>>> 
>>> Hi,
>>> 
>>> This is really a general question.
>>> 
>>> I use Spark to get twitter data. I did some looking at it
>>> 
>>> val ssc = new StreamingContext(sparkConf, Seconds(2))
>>> val tweets = TwitterUtils.createStream(ssc, None)
>>> val statuses = tweets.map(status => status.getText())
>>> statuses.print()
>>> 
>>> Ok
>>> 
>>> Also I can use Apache flume to store data in hdfs directory
>>> 
>>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf 
>>> Dflume.root.logger=DEBUG,console -n TwitterAgent
>>> Now that stores twitter data in binary format in  hdfs directory.
>>> 
>>> My question is pretty basic.
>>> 
>>> What is the best tool/language to dif in to that data. For example twitter 
>>> streaming data. I am getting all sorts od stuff coming in. Say I am only 
>>> interested in certain topics like sport etc. How can I detect the signal 
>>> from the noise using what tool and language?
>>> 
>>> Thanks
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
> 


Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
Thanks Jorn,

To start I would like to explore how can one turn some of the data into
useful information.

I would like to look at certain trend analysis. Simple correlation shows
that the more there is a mention of a typical topic say for example
"organic food" the more people are inclined to go for it. To see one can
deduce that orgaind food is a potential growth area.

Now I have all infra-structure to ingest that data. Like using flume to
store it or Spark streaming to do near real time work.

Now I want to slice and dice that data for say organic food.

I presume this is a typical question.

You mentioned Spark ml (machine learning?) . Is that something viable?

Cheers





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 12:22, Jörn Franke  wrote:

> Spark ml Support Vector machines or neural networks could be candidates.
> For unstructured learning it could be clustering.
> For doing a graph analysis On the followers you can easily use Spark Graphx
> Keep in mind that each tweet contains a lot of meta data (location,
> followers etc) that is more or less structured.
> For unstructured text analytics (eg tweet itself)I recommend
> solr/ElasticSearch .
>
> However I am not sure what you want to do with the data exactly.
>
>
> On 07 Jun 2016, at 13:16, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> This is really a general question.
>
> I use Spark to get twitter data. I did some looking at it
>
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> val tweets = TwitterUtils.createStream(ssc, None)
> val statuses = tweets.map(status => status.getText())
> statuses.print()
>
> Ok
>
> Also I can use Apache flume to store data in hdfs directory
>
> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
> Dflume.root.logger=DEBUG,console -n TwitterAgent
> Now that stores twitter data in binary format in  hdfs directory.
>
> My question is pretty basic.
>
> What is the best tool/language to dif in to that data. For example twitter
> streaming data. I am getting all sorts od stuff coming in. Say I am only
> interested in certain topics like sport etc. How can I detect the signal
> from the noise using what tool and language?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>


Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke
Spark ml Support Vector machines or neural networks could be candidates. 
For unstructured learning it could be clustering.
For doing a graph analysis On the followers you can easily use Spark Graphx
Keep in mind that each tweet contains a lot of meta data (location, followers 
etc) that is more or less structured.
For unstructured text analytics (eg tweet itself)I recommend solr/ElasticSearch 
.

However I am not sure what you want to do with the data exactly.


> On 07 Jun 2016, at 13:16, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> This is really a general question.
> 
> I use Spark to get twitter data. I did some looking at it
> 
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> val tweets = TwitterUtils.createStream(ssc, None)
> val statuses = tweets.map(status => status.getText())
> statuses.print()
> 
> Ok
> 
> Also I can use Apache flume to store data in hdfs directory
> 
> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf 
> Dflume.root.logger=DEBUG,console -n TwitterAgent
> Now that stores twitter data in binary format in  hdfs directory.
> 
> My question is pretty basic.
> 
> What is the best tool/language to dif in to that data. For example twitter 
> streaming data. I am getting all sorts od stuff coming in. Say I am only 
> interested in certain topics like sport etc. How can I detect the signal from 
> the noise using what tool and language?
> 
> Thanks
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  


Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
Hi,

This is really a general question.

I use Spark to get twitter data. I did some looking at it

val ssc = new StreamingContext(sparkConf, Seconds(2))
val tweets = TwitterUtils.createStream(ssc, None)
val statuses = tweets.map(status => status.getText())
statuses.print()

Ok

Also I can use Apache flume to store data in hdfs directory

$FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
Dflume.root.logger=DEBUG,console -n TwitterAgent
Now that stores twitter data in binary format in  hdfs directory.

My question is pretty basic.

What is the best tool/language to dif in to that data. For example twitter
streaming data. I am getting all sorts od stuff coming in. Say I am only
interested in certain topics like sport etc. How can I detect the signal
from the noise using what tool and language?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com