Actually, I posted that question :)
I already implemented  solution that Akhil suggested there , and that
solution is using Sample tweets API, which returns only 1% of the tweets.
It would not work in my scenario of use. For the hashtags I'm interested
in, I need to catch each single tweet, not only some of them.
So for me, only twitter filtering API would work, but as I already wrote,
there is another problem. Twitter  limits to maximum number of 400 hashtags
you can use in the filter. That means I need several parallel twitter
streams in order to follow more hashtags.
That was the problem I could not solve with Spark twitter streaming. I
could not start parallel streams. The other problem is that I need to add
and remove hashtags from the running streams, that is, I need to clean up
stream, and initialize filter again. I managed to implement this with
twitter4j directly, but not with spark-twitter streaming.

Zoran



On Wed, Jul 29, 2015 at 10:40 AM, Peyman Mohajerian <mohaj...@gmail.com>
wrote:

> 'How to restart Twitter spark stream' i
> It may not be exactly what you are looking for, but i thought it did touch
> on some aspect of your question.
>
> On Wed, Jul 29, 2015 at 10:26 AM, Zoran Jeremic <zoran.jere...@gmail.com>
> wrote:
>
>> Can you send me the subject of that email? I can't find any email
>> suggesting solution to that problem. There is email "*Twitter4j
>> streaming question*", but it doesn't have any sample code. It just
>> confirms what I explained earlier that without filtering Twitter will limit
>> to 1% of tweets, and if you use filter API, Twitter limits you to 400
>> hashtags you can follow.
>>
>> Thanks,
>> Zoran
>>
>> On Wed, Jul 29, 2015 at 8:40 AM, Peyman Mohajerian <mohaj...@gmail.com>
>> wrote:
>>
>>> This question was answered with sample code a couple of days ago, please
>>> look back.
>>>
>>> On Sat, Jul 25, 2015 at 11:43 PM, Zoran Jeremic <zoran.jere...@gmail.com
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> I discovered what is the problem here. Twitter public stream is limited
>>>> to 1% of overall tweets (https://goo.gl/kDwnyS), so that's why I can't
>>>> access all the tweets posted with specific hashtag using approach that I
>>>> posted in previous email, so I guess this approach would not work for me.
>>>> The other problem is that filtering has a limit of 400 hashtags (
>>>> https://goo.gl/BywrAk), so in order to follow more than 400 hashtags I
>>>> need more parallel streams.
>>>>
>>>> This brings me back to my previous question (https://goo.gl/bVDkHx).
>>>> In my application I need to follow more than 400 hashtags, and I need to
>>>> collect each tweet having one of these hashtags. Another complication is
>>>> that users could add new hashtags or remove old hashtags, so I have to
>>>> update stream in the real-time.
>>>> My earlier approach without Apache Spark was to create twitter4j user
>>>> stream with initial filter, and each time new hashtag has to be added, stop
>>>> stream, add new hashtag and run it again. When stream had 400 hashtags, I
>>>> initialize new stream with new credentials. This was really complex, and I
>>>> was hopping that Apache Spark would make it simpler. However, I'm trying
>>>> for a days to find solution, and had no success.
>>>>
>>>> If I have to use the same approach I used with twitter4j, I have to
>>>> solve 2 problems:
>>>> - how to run multiple twitter streams in the same spark context
>>>> - how to add new hashtags to the existing filter
>>>>
>>>> I hope that somebody will have some more elegant solution and idea, and
>>>> tell me that I missed something obvious.
>>>>
>>>> Thanks,
>>>> Zoran
>>>>
>>>> On Sat, Jul 25, 2015 at 8:44 PM, Zoran Jeremic <zoran.jere...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've implemented Twitter streaming as in the code given at the bottom
>>>>> of email. It finds some tweets based on the hashtags I'm following.
>>>>> However, it seems that a large amount of tweets is missing. I've tried to
>>>>> post some tweets that I'm following in the application, and none of them
>>>>> was received in application. I also checked some hashtags (e.g. #android)
>>>>> on Twitter using Live and I could see that almost each second something 
>>>>> was
>>>>> posted with that hashtag, and my application received only 3-4 posts in 
>>>>> one
>>>>> minute.
>>>>>
>>>>> I didn't have this problem in earlier non-spark version of application
>>>>> which used twitter4j to access user stream API. I guess this is some
>>>>> trending stream, but I couldn't find anything that explains which Twitter
>>>>> API is used in Spark Twitter Streaming and how to create stream that will
>>>>> access everything posted on the Twitter.
>>>>>
>>>>> I hope somebody could explain what is the problem and how to solve
>>>>> this.
>>>>>
>>>>> Thanks,
>>>>> Zoran
>>>>>
>>>>>
>>>>>  def initializeStreaming(){
>>>>>>    val config = getTwitterConfigurationBuilder.build()
>>>>>>    val auth: Option[twitter4j.auth.Authorization] = Some(new
>>>>>> twitter4j.auth.OAuthAuthorization(config))
>>>>>>    val stream:DStream[Status]  = TwitterUtils.createStream(ssc,
>>>>>> auth)
>>>>>>    val filtered_statuses = stream.transform(rdd =>{
>>>>>>     val filtered = rdd.filter(status =>{
>>>>>>     var found = false
>>>>>>         for(tag <- hashTagsList){
>>>>>>           if(status.getText.toLowerCase.contains(tag)) {
>>>>>>             found = true
>>>>>>             }
>>>>>>         }
>>>>>>         found
>>>>>>       })
>>>>>>       filtered
>>>>>>     })
>>>>>>     filtered_statuses.foreachRDD(rdd => {
>>>>>>       rdd.collect.foreach(t => {
>>>>>>         println(t)
>>>>>>       })
>>>>>>    })
>>>>>>     ssc.start()
>>>>>>   }
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> *******************************************************************************
>> Zoran Jeremic, PhD
>> Senior System Analyst & Programmer
>>
>> Athabasca University
>> Tel: +1 604 92 89 944
>> E-mail: zoran.jere...@gmail.com <zoran.jere...@va.mod.gov.rs>
>> Homepage:  http://zoranjeremic.org
>>
>> **********************************************************************************
>>
>
>


-- 
*******************************************************************************
Zoran Jeremic, PhD
Senior System Analyst & Programmer

Athabasca University
Tel: +1 604 92 89 944
E-mail: zoran.jere...@gmail.com <zoran.jere...@va.mod.gov.rs>
Homepage:  http://zoranjeremic.org
**********************************************************************************

Reply via email to