Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Peyman Mohajerian
This question was answered with sample code a couple of days ago, please
look back.

On Sat, Jul 25, 2015 at 11:43 PM, Zoran Jeremic


 I discovered what is the problem here. Twitter public stream is limited to
 1% of overall tweets (, so that's why I can't
 access all the tweets posted with specific hashtag using approach that I
 posted in previous email, so I guess this approach would not work for me.
 The other problem is that filtering has a limit of 400 hashtags (, so in order to follow more than 400 hashtags I
 need more parallel streams.

 This brings me back to my previous question ( In
 my application I need to follow more than 400 hashtags, and I need to
 collect each tweet having one of these hashtags. Another complication is
 that users could add new hashtags or remove old hashtags, so I have to
 update stream in the real-time.
 My earlier approach without Apache Spark was to create twitter4j user
 stream with initial filter, and each time new hashtag has to be added, stop
 stream, add new hashtag and run it again. When stream had 400 hashtags, I
 initialize new stream with new credentials. This was really complex, and I
 was hopping that Apache Spark would make it simpler. However, I'm trying
 for a days to find solution, and had no success.

 If I have to use the same approach I used with twitter4j, I have to solve
 2 problems:
 - how to run multiple twitter streams in the same spark context
 - how to add new hashtags to the existing filter

 I hope that somebody will have some more elegant solution and idea, and
 tell me that I missed something obvious.


 On Sat, Jul 25, 2015 at 8:44 PM, Zoran Jeremic


 I've implemented Twitter streaming as in the code given at the bottom of
 email. It finds some tweets based on the hashtags I'm following. However,
 it seems that a large amount of tweets is missing. I've tried to post some
 tweets that I'm following in the application, and none of them was received
 in application. I also checked some hashtags (e.g. #android) on Twitter
 using Live and I could see that almost each second something was posted
 with that hashtag, and my application received only 3-4 posts in one minute.

 I didn't have this problem in earlier non-spark version of application
 which used twitter4j to access user stream API. I guess this is some
 trending stream, but I couldn't find anything that explains which Twitter
 API is used in Spark Twitter Streaming and how to create stream that will
 access everything posted on the Twitter.

 I hope somebody could explain what is the problem and how to solve this.


  def initializeStreaming(){
val config =
val auth: Option[twitter4j.auth.Authorization] = Some(new
val stream:DStream[Status]  = TwitterUtils.createStream(ssc, auth)
val filtered_statuses = stream.transform(rdd ={
 val filtered = rdd.filter(status ={
 var found = false
 for(tag - hashTagsList){
   if(status.getText.toLowerCase.contains(tag)) {
 found = true
 filtered_statuses.foreachRDD(rdd = {
   rdd.collect.foreach(t = {

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Zoran Jeremic
Can you send me the subject of that email? I can't find any email
suggesting solution to that problem. There is email *Twitter4j streaming
question*, but it doesn't have any sample code. It just confirms what I
explained earlier that without filtering Twitter will limit to 1% of
tweets, and if you use filter API, Twitter limits you to 400 hashtags you
can follow.


On Wed, Jul 29, 2015 at 8:40 AM, Peyman Mohajerian

 This question was answered with sample code a couple of days ago, please
 look back.

 On Sat, Jul 25, 2015 at 11:43 PM, Zoran Jeremic


 I discovered what is the problem here. Twitter public stream is limited
 to 1% of overall tweets (, so that's why I can't
 access all the tweets posted with specific hashtag using approach that I
 posted in previous email, so I guess this approach would not work for me.
 The other problem is that filtering has a limit of 400 hashtags (, so in order to follow more than 400 hashtags I
 need more parallel streams.

 This brings me back to my previous question ( In
 my application I need to follow more than 400 hashtags, and I need to
 collect each tweet having one of these hashtags. Another complication is
 that users could add new hashtags or remove old hashtags, so I have to
 update stream in the real-time.
 My earlier approach without Apache Spark was to create twitter4j user
 stream with initial filter, and each time new hashtag has to be added, stop
 stream, add new hashtag and run it again. When stream had 400 hashtags, I
 initialize new stream with new credentials. This was really complex, and I
 was hopping that Apache Spark would make it simpler. However, I'm trying
 for a days to find solution, and had no success.

 If I have to use the same approach I used with twitter4j, I have to solve
 2 problems:
 - how to run multiple twitter streams in the same spark context
 - how to add new hashtags to the existing filter

 I hope that somebody will have some more elegant solution and idea, and
 tell me that I missed something obvious.


 On Sat, Jul 25, 2015 at 8:44 PM, Zoran Jeremic


 I've implemented Twitter streaming as in the code given at the bottom of
 email. It finds some tweets based on the hashtags I'm following. However,
 it seems that a large amount of tweets is missing. I've tried to post some
 tweets that I'm following in the application, and none of them was received
 in application. I also checked some hashtags (e.g. #android) on Twitter
 using Live and I could see that almost each second something was posted
 with that hashtag, and my application received only 3-4 posts in one minute.

 I didn't have this problem in earlier non-spark version of application
 which used twitter4j to access user stream API. I guess this is some
 trending stream, but I couldn't find anything that explains which Twitter
 API is used in Spark Twitter Streaming and how to create stream that will
 access everything posted on the Twitter.

 I hope somebody could explain what is the problem and how to solve this.


  def initializeStreaming(){
val config =
val auth: Option[twitter4j.auth.Authorization] = Some(new
val stream:DStream[Status]  = TwitterUtils.createStream(ssc, auth)
val filtered_statuses = stream.transform(rdd ={
 val filtered = rdd.filter(status ={
 var found = false
 for(tag - hashTagsList){
   if(status.getText.toLowerCase.contains(tag)) {
 found = true
 filtered_statuses.foreachRDD(rdd = {
   rdd.collect.foreach(t = {

Zoran Jeremic, PhD
Senior System Analyst  Programmer

Athabasca University
Tel: +1 604 92 89 944

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Enno Shioji
If you start parallel Twitter streams, you will be in breach of their TOS.
They allow a small number of parallel stream in practice, but if you do it
on massive scale they'll ban you (I'm speaking from experience ;) ).

If you really need that level of data, you need to talk to a company called
Gnip - AFAIK they are the sole reseller now. It's not cheap though.

On Wed, Jul 29, 2015 at 7:02 PM, Zoran Jeremic

 Actually, I posted that question :)
 I already implemented  solution that Akhil suggested there , and that
 solution is using Sample tweets API, which returns only 1% of the tweets.
 It would not work in my scenario of use. For the hashtags I'm interested
 in, I need to catch each single tweet, not only some of them.
 So for me, only twitter filtering API would work, but as I already wrote,
 there is another problem. Twitter  limits to maximum number of 400 hashtags
 you can use in the filter. That means I need several parallel twitter
 streams in order to follow more hashtags.
 That was the problem I could not solve with Spark twitter streaming. I
 could not start parallel streams. The other problem is that I need to add
 and remove hashtags from the running streams, that is, I need to clean up
 stream, and initialize filter again. I managed to implement this with
 twitter4j directly, but not with spark-twitter streaming.


 On Wed, Jul 29, 2015 at 10:40 AM, Peyman Mohajerian

 'How to restart Twitter spark stream' i
 It may not be exactly what you are looking for, but i thought it did
 touch on some aspect of your question.

 On Wed, Jul 29, 2015 at 10:26 AM, Zoran Jeremic

 Can you send me the subject of that email? I can't find any email
 suggesting solution to that problem. There is email *Twitter4j
 streaming question*, but it doesn't have any sample code. It just
 confirms what I explained earlier that without filtering Twitter will limit
 to 1% of tweets, and if you use filter API, Twitter limits you to 400
 hashtags you can follow.


 On Wed, Jul 29, 2015 at 8:40 AM, Peyman Mohajerian

 This question was answered with sample code a couple of days ago,
 please look back.

 On Sat, Jul 25, 2015 at 11:43 PM, Zoran Jeremic wrote:


 I discovered what is the problem here. Twitter public stream is
 limited to 1% of overall tweets (, so that's
 why I can't access all the tweets posted with specific hashtag using
 approach that I posted in previous email, so I guess this approach would
 not work for me. The other problem is that filtering has a limit of 400
 hashtags (, so in order to follow more than 400
 hashtags I need more parallel streams.

 This brings me back to my previous question (
 In my application I need to follow more than 400 hashtags, and I need to
 collect each tweet having one of these hashtags. Another complication is
 that users could add new hashtags or remove old hashtags, so I have to
 update stream in the real-time.
 My earlier approach without Apache Spark was to create twitter4j user
 stream with initial filter, and each time new hashtag has to be added, 
 stream, add new hashtag and run it again. When stream had 400 hashtags, I
 initialize new stream with new credentials. This was really complex, and I
 was hopping that Apache Spark would make it simpler. However, I'm trying
 for a days to find solution, and had no success.

 If I have to use the same approach I used with twitter4j, I have to
 solve 2 problems:
 - how to run multiple twitter streams in the same spark context
 - how to add new hashtags to the existing filter

 I hope that somebody will have some more elegant solution and idea,
 and tell me that I missed something obvious.


 On Sat, Jul 25, 2015 at 8:44 PM, Zoran Jeremic wrote:


 I've implemented Twitter streaming as in the code given at the bottom
 of email. It finds some tweets based on the hashtags I'm following.
 However, it seems that a large amount of tweets is missing. I've tried to
 post some tweets that I'm following in the application, and none of them
 was received in application. I also checked some hashtags (e.g. #android)
 on Twitter using Live and I could see that almost each second something 
 posted with that hashtag, and my application received only 3-4 posts in 

 I didn't have this problem in earlier non-spark version of
 application which used twitter4j to access user stream API. I guess this 
 some trending stream, but I couldn't find anything that explains which
 Twitter API is used in Spark Twitter Streaming and how to create stream
 that will access everything posted on the Twitter.

 I hope somebody could explain what is the problem and how to solve


  def initializeStreaming(){
val config 

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Peyman Mohajerian
'How to restart Twitter spark stream' i
It may not be exactly what you are looking for, but i thought it did touch
on some aspect of your question.

On Wed, Jul 29, 2015 at 10:26 AM, Zoran Jeremic

 Can you send me the subject of that email? I can't find any email
 suggesting solution to that problem. There is email *Twitter4j streaming
 question*, but it doesn't have any sample code. It just confirms what I
 explained earlier that without filtering Twitter will limit to 1% of
 tweets, and if you use filter API, Twitter limits you to 400 hashtags you
 can follow.


 On Wed, Jul 29, 2015 at 8:40 AM, Peyman Mohajerian

 This question was answered with sample code a couple of days ago, please
 look back.

 On Sat, Jul 25, 2015 at 11:43 PM, Zoran Jeremic


 I discovered what is the problem here. Twitter public stream is limited
 to 1% of overall tweets (, so that's why I can't
 access all the tweets posted with specific hashtag using approach that I
 posted in previous email, so I guess this approach would not work for me.
 The other problem is that filtering has a limit of 400 hashtags (, so in order to follow more than 400 hashtags I
 need more parallel streams.

 This brings me back to my previous question ( In
 my application I need to follow more than 400 hashtags, and I need to
 collect each tweet having one of these hashtags. Another complication is
 that users could add new hashtags or remove old hashtags, so I have to
 update stream in the real-time.
 My earlier approach without Apache Spark was to create twitter4j user
 stream with initial filter, and each time new hashtag has to be added, stop
 stream, add new hashtag and run it again. When stream had 400 hashtags, I
 initialize new stream with new credentials. This was really complex, and I
 was hopping that Apache Spark would make it simpler. However, I'm trying
 for a days to find solution, and had no success.

 If I have to use the same approach I used with twitter4j, I have to
 solve 2 problems:
 - how to run multiple twitter streams in the same spark context
 - how to add new hashtags to the existing filter

 I hope that somebody will have some more elegant solution and idea, and
 tell me that I missed something obvious.


 On Sat, Jul 25, 2015 at 8:44 PM, Zoran Jeremic


 I've implemented Twitter streaming as in the code given at the bottom
 of email. It finds some tweets based on the hashtags I'm following.
 However, it seems that a large amount of tweets is missing. I've tried to
 post some tweets that I'm following in the application, and none of them
 was received in application. I also checked some hashtags (e.g. #android)
 on Twitter using Live and I could see that almost each second something was
 posted with that hashtag, and my application received only 3-4 posts in one

 I didn't have this problem in earlier non-spark version of application
 which used twitter4j to access user stream API. I guess this is some
 trending stream, but I couldn't find anything that explains which Twitter
 API is used in Spark Twitter Streaming and how to create stream that will
 access everything posted on the Twitter.

 I hope somebody could explain what is the problem and how to solve this.


  def initializeStreaming(){
val config =
val auth: Option[twitter4j.auth.Authorization] = Some(new
val stream:DStream[Status]  = TwitterUtils.createStream(ssc, auth)
val filtered_statuses = stream.transform(rdd ={
 val filtered = rdd.filter(status ={
 var found = false
 for(tag - hashTagsList){
   if(status.getText.toLowerCase.contains(tag)) {
 found = true
 filtered_statuses.foreachRDD(rdd = {
   rdd.collect.foreach(t = {


 Zoran Jeremic, PhD
 Senior System Analyst  Programmer

 Athabasca University
 Tel: +1 604 92 89 944


Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Zoran Jeremic
Actually, I posted that question :)
I already implemented  solution that Akhil suggested there , and that
solution is using Sample tweets API, which returns only 1% of the tweets.
It would not work in my scenario of use. For the hashtags I'm interested
in, I need to catch each single tweet, not only some of them.
So for me, only twitter filtering API would work, but as I already wrote,
there is another problem. Twitter  limits to maximum number of 400 hashtags
you can use in the filter. That means I need several parallel twitter
streams in order to follow more hashtags.
That was the problem I could not solve with Spark twitter streaming. I
could not start parallel streams. The other problem is that I need to add
and remove hashtags from the running streams, that is, I need to clean up
stream, and initialize filter again. I managed to implement this with
twitter4j directly, but not with spark-twitter streaming.


On Wed, Jul 29, 2015 at 10:40 AM, Peyman Mohajerian

 'How to restart Twitter spark stream' i
 It may not be exactly what you are looking for, but i thought it did touch
 on some aspect of your question.

 On Wed, Jul 29, 2015 at 10:26 AM, Zoran Jeremic

 Can you send me the subject of that email? I can't find any email
 suggesting solution to that problem. There is email *Twitter4j
 streaming question*, but it doesn't have any sample code. It just
 confirms what I explained earlier that without filtering Twitter will limit
 to 1% of tweets, and if you use filter API, Twitter limits you to 400
 hashtags you can follow.


 On Wed, Jul 29, 2015 at 8:40 AM, Peyman Mohajerian

 This question was answered with sample code a couple of days ago, please
 look back.

 On Sat, Jul 25, 2015 at 11:43 PM, Zoran Jeremic


 I discovered what is the problem here. Twitter public stream is limited
 to 1% of overall tweets (, so that's why I can't
 access all the tweets posted with specific hashtag using approach that I
 posted in previous email, so I guess this approach would not work for me.
 The other problem is that filtering has a limit of 400 hashtags (, so in order to follow more than 400 hashtags I
 need more parallel streams.

 This brings me back to my previous question (
 In my application I need to follow more than 400 hashtags, and I need to
 collect each tweet having one of these hashtags. Another complication is
 that users could add new hashtags or remove old hashtags, so I have to
 update stream in the real-time.
 My earlier approach without Apache Spark was to create twitter4j user
 stream with initial filter, and each time new hashtag has to be added, stop
 stream, add new hashtag and run it again. When stream had 400 hashtags, I
 initialize new stream with new credentials. This was really complex, and I
 was hopping that Apache Spark would make it simpler. However, I'm trying
 for a days to find solution, and had no success.

 If I have to use the same approach I used with twitter4j, I have to
 solve 2 problems:
 - how to run multiple twitter streams in the same spark context
 - how to add new hashtags to the existing filter

 I hope that somebody will have some more elegant solution and idea, and
 tell me that I missed something obvious.


 On Sat, Jul 25, 2015 at 8:44 PM, Zoran Jeremic


 I've implemented Twitter streaming as in the code given at the bottom
 of email. It finds some tweets based on the hashtags I'm following.
 However, it seems that a large amount of tweets is missing. I've tried to
 post some tweets that I'm following in the application, and none of them
 was received in application. I also checked some hashtags (e.g. #android)
 on Twitter using Live and I could see that almost each second something 
 posted with that hashtag, and my application received only 3-4 posts in 

 I didn't have this problem in earlier non-spark version of application
 which used twitter4j to access user stream API. I guess this is some
 trending stream, but I couldn't find anything that explains which Twitter
 API is used in Spark Twitter Streaming and how to create stream that will
 access everything posted on the Twitter.

 I hope somebody could explain what is the problem and how to solve


  def initializeStreaming(){
val config =
val auth: Option[twitter4j.auth.Authorization] = Some(new
val stream:DStream[Status]  = TwitterUtils.createStream(ssc,
val filtered_statuses = stream.transform(rdd ={
 val filtered = rdd.filter(status ={
 var found = false
 for(tag - hashTagsList){
   if(status.getText.toLowerCase.contains(tag)) {
 found = true

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-26 Thread Zoran Jeremic

I discovered what is the problem here. Twitter public stream is limited to
1% of overall tweets (, so that's why I can't access
all the tweets posted with specific hashtag using approach that I posted in
previous email, so I guess this approach would not work for me. The other
problem is that filtering has a limit of 400 hashtags (,
so in order to follow more than 400 hashtags I need more parallel streams.

This brings me back to my previous question ( In my
application I need to follow more than 400 hashtags, and I need to collect
each tweet having one of these hashtags. Another complication is that users
could add new hashtags or remove old hashtags, so I have to update stream
in the real-time.
My earlier approach without Apache Spark was to create twitter4j user
stream with initial filter, and each time new hashtag has to be added, stop
stream, add new hashtag and run it again. When stream had 400 hashtags, I
initialize new stream with new credentials. This was really complex, and I
was hopping that Apache Spark would make it simpler. However, I'm trying
for a days to find solution, and had no success.

If I have to use the same approach I used with twitter4j, I have to solve 2
- how to run multiple twitter streams in the same spark context
- how to add new hashtags to the existing filter

I hope that somebody will have some more elegant solution and idea, and
tell me that I missed something obvious.


On Sat, Jul 25, 2015 at 8:44 PM, Zoran Jeremic


 I've implemented Twitter streaming as in the code given at the bottom of
 email. It finds some tweets based on the hashtags I'm following. However,
 it seems that a large amount of tweets is missing. I've tried to post some
 tweets that I'm following in the application, and none of them was received
 in application. I also checked some hashtags (e.g. #android) on Twitter
 using Live and I could see that almost each second something was posted
 with that hashtag, and my application received only 3-4 posts in one minute.

 I didn't have this problem in earlier non-spark version of application
 which used twitter4j to access user stream API. I guess this is some
 trending stream, but I couldn't find anything that explains which Twitter
 API is used in Spark Twitter Streaming and how to create stream that will
 access everything posted on the Twitter.

 I hope somebody could explain what is the problem and how to solve this.


  def initializeStreaming(){
val config =
val auth: Option[twitter4j.auth.Authorization] = Some(new
val stream:DStream[Status]  = TwitterUtils.createStream(ssc, auth)
val filtered_statuses = stream.transform(rdd ={
 val filtered = rdd.filter(status ={
 var found = false
 for(tag - hashTagsList){
   if(status.getText.toLowerCase.contains(tag)) {
 found = true
 filtered_statuses.foreachRDD(rdd = {
   rdd.collect.foreach(t = {

Twitter streaming with apache spark stream only a small amount of tweets

2015-07-25 Thread Zoran Jeremic

I've implemented Twitter streaming as in the code given at the bottom of
email. It finds some tweets based on the hashtags I'm following. However,
it seems that a large amount of tweets is missing. I've tried to post some
tweets that I'm following in the application, and none of them was received
in application. I also checked some hashtags (e.g. #android) on Twitter
using Live and I could see that almost each second something was posted
with that hashtag, and my application received only 3-4 posts in one minute.

I didn't have this problem in earlier non-spark version of application
which used twitter4j to access user stream API. I guess this is some
trending stream, but I couldn't find anything that explains which Twitter
API is used in Spark Twitter Streaming and how to create stream that will
access everything posted on the Twitter.

I hope somebody could explain what is the problem and how to solve this.


 def initializeStreaming(){
val config =
val auth: Option[twitter4j.auth.Authorization] = Some(new
val stream:DStream[Status]  = TwitterUtils.createStream(ssc, auth)
val filtered_statuses = stream.transform(rdd ={
 val filtered = rdd.filter(status ={
 var found = false
 for(tag - hashTagsList){
   if(status.getText.toLowerCase.contains(tag)) {
 found = true
 filtered_statuses.foreachRDD(rdd = {
   rdd.collect.foreach(t = {