Hi,

I discovered what is the problem here. Twitter public stream is limited to
1% of overall tweets (https://goo.gl/kDwnyS), so that's why I can't access
all the tweets posted with specific hashtag using approach that I posted in
previous email, so I guess this approach would not work for me. The other
problem is that filtering has a limit of 400 hashtags (https://goo.gl/BywrAk),
so in order to follow more than 400 hashtags I need more parallel streams.

This brings me back to my previous question (https://goo.gl/bVDkHx). In my
application I need to follow more than 400 hashtags, and I need to collect
each tweet having one of these hashtags. Another complication is that users
could add new hashtags or remove old hashtags, so I have to update stream
in the real-time.
My earlier approach without Apache Spark was to create twitter4j user
stream with initial filter, and each time new hashtag has to be added, stop
stream, add new hashtag and run it again. When stream had 400 hashtags, I
initialize new stream with new credentials. This was really complex, and I
was hopping that Apache Spark would make it simpler. However, I'm trying
for a days to find solution, and had no success.

If I have to use the same approach I used with twitter4j, I have to solve 2
problems:
- how to run multiple twitter streams in the same spark context
- how to add new hashtags to the existing filter

I hope that somebody will have some more elegant solution and idea, and
tell me that I missed something obvious.

Thanks,
Zoran

On Sat, Jul 25, 2015 at 8:44 PM, Zoran Jeremic <zoran.jere...@gmail.com>
wrote:

> Hi,
>
> I've implemented Twitter streaming as in the code given at the bottom of
> email. It finds some tweets based on the hashtags I'm following. However,
> it seems that a large amount of tweets is missing. I've tried to post some
> tweets that I'm following in the application, and none of them was received
> in application. I also checked some hashtags (e.g. #android) on Twitter
> using Live and I could see that almost each second something was posted
> with that hashtag, and my application received only 3-4 posts in one minute.
>
> I didn't have this problem in earlier non-spark version of application
> which used twitter4j to access user stream API. I guess this is some
> trending stream, but I couldn't find anything that explains which Twitter
> API is used in Spark Twitter Streaming and how to create stream that will
> access everything posted on the Twitter.
>
> I hope somebody could explain what is the problem and how to solve this.
>
> Thanks,
> Zoran
>
>
>  def initializeStreaming(){
>>    val config = getTwitterConfigurationBuilder.build()
>>    val auth: Option[twitter4j.auth.Authorization] = Some(new
>> twitter4j.auth.OAuthAuthorization(config))
>>    val stream:DStream[Status]  = TwitterUtils.createStream(ssc, auth)
>>    val filtered_statuses = stream.transform(rdd =>{
>>     val filtered = rdd.filter(status =>{
>>     var found = false
>>         for(tag <- hashTagsList){
>>           if(status.getText.toLowerCase.contains(tag)) {
>>             found = true
>>             }
>>         }
>>         found
>>       })
>>       filtered
>>     })
>>     filtered_statuses.foreachRDD(rdd => {
>>       rdd.collect.foreach(t => {
>>         println(t)
>>       })
>>    })
>>     ssc.start()
>>   }
>>
>

Reply via email to