If you start parallel Twitter streams, you will be in breach of their TOS. They allow a small number of parallel stream in practice, but if you do it on massive scale they'll ban you (I'm speaking from experience ;) ).
If you really need that level of data, you need to talk to a company called Gnip - AFAIK they are the sole reseller now. It's not cheap though. On Wed, Jul 29, 2015 at 7:02 PM, Zoran Jeremic <zoran.jere...@gmail.com> wrote: > Actually, I posted that question :) > I already implemented solution that Akhil suggested there , and that > solution is using Sample tweets API, which returns only 1% of the tweets. > It would not work in my scenario of use. For the hashtags I'm interested > in, I need to catch each single tweet, not only some of them. > So for me, only twitter filtering API would work, but as I already wrote, > there is another problem. Twitter limits to maximum number of 400 hashtags > you can use in the filter. That means I need several parallel twitter > streams in order to follow more hashtags. > That was the problem I could not solve with Spark twitter streaming. I > could not start parallel streams. The other problem is that I need to add > and remove hashtags from the running streams, that is, I need to clean up > stream, and initialize filter again. I managed to implement this with > twitter4j directly, but not with spark-twitter streaming. > > Zoran > > > > On Wed, Jul 29, 2015 at 10:40 AM, Peyman Mohajerian <mohaj...@gmail.com> > wrote: > >> 'How to restart Twitter spark stream' i >> It may not be exactly what you are looking for, but i thought it did >> touch on some aspect of your question. >> >> On Wed, Jul 29, 2015 at 10:26 AM, Zoran Jeremic <zoran.jere...@gmail.com> >> wrote: >> >>> Can you send me the subject of that email? I can't find any email >>> suggesting solution to that problem. There is email "*Twitter4j >>> streaming question*", but it doesn't have any sample code. It just >>> confirms what I explained earlier that without filtering Twitter will limit >>> to 1% of tweets, and if you use filter API, Twitter limits you to 400 >>> hashtags you can follow. >>> >>> Thanks, >>> Zoran >>> >>> On Wed, Jul 29, 2015 at 8:40 AM, Peyman Mohajerian <mohaj...@gmail.com> >>> wrote: >>> >>>> This question was answered with sample code a couple of days ago, >>>> please look back. >>>> >>>> On Sat, Jul 25, 2015 at 11:43 PM, Zoran Jeremic < >>>> zoran.jere...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I discovered what is the problem here. Twitter public stream is >>>>> limited to 1% of overall tweets (https://goo.gl/kDwnyS), so that's >>>>> why I can't access all the tweets posted with specific hashtag using >>>>> approach that I posted in previous email, so I guess this approach would >>>>> not work for me. The other problem is that filtering has a limit of 400 >>>>> hashtags (https://goo.gl/BywrAk), so in order to follow more than 400 >>>>> hashtags I need more parallel streams. >>>>> >>>>> This brings me back to my previous question (https://goo.gl/bVDkHx). >>>>> In my application I need to follow more than 400 hashtags, and I need to >>>>> collect each tweet having one of these hashtags. Another complication is >>>>> that users could add new hashtags or remove old hashtags, so I have to >>>>> update stream in the real-time. >>>>> My earlier approach without Apache Spark was to create twitter4j user >>>>> stream with initial filter, and each time new hashtag has to be added, >>>>> stop >>>>> stream, add new hashtag and run it again. When stream had 400 hashtags, I >>>>> initialize new stream with new credentials. This was really complex, and I >>>>> was hopping that Apache Spark would make it simpler. However, I'm trying >>>>> for a days to find solution, and had no success. >>>>> >>>>> If I have to use the same approach I used with twitter4j, I have to >>>>> solve 2 problems: >>>>> - how to run multiple twitter streams in the same spark context >>>>> - how to add new hashtags to the existing filter >>>>> >>>>> I hope that somebody will have some more elegant solution and idea, >>>>> and tell me that I missed something obvious. >>>>> >>>>> Thanks, >>>>> Zoran >>>>> >>>>> On Sat, Jul 25, 2015 at 8:44 PM, Zoran Jeremic < >>>>> zoran.jere...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I've implemented Twitter streaming as in the code given at the bottom >>>>>> of email. It finds some tweets based on the hashtags I'm following. >>>>>> However, it seems that a large amount of tweets is missing. I've tried to >>>>>> post some tweets that I'm following in the application, and none of them >>>>>> was received in application. I also checked some hashtags (e.g. #android) >>>>>> on Twitter using Live and I could see that almost each second something >>>>>> was >>>>>> posted with that hashtag, and my application received only 3-4 posts in >>>>>> one >>>>>> minute. >>>>>> >>>>>> I didn't have this problem in earlier non-spark version of >>>>>> application which used twitter4j to access user stream API. I guess this >>>>>> is >>>>>> some trending stream, but I couldn't find anything that explains which >>>>>> Twitter API is used in Spark Twitter Streaming and how to create stream >>>>>> that will access everything posted on the Twitter. >>>>>> >>>>>> I hope somebody could explain what is the problem and how to solve >>>>>> this. >>>>>> >>>>>> Thanks, >>>>>> Zoran >>>>>> >>>>>> >>>>>> def initializeStreaming(){ >>>>>>> val config = getTwitterConfigurationBuilder.build() >>>>>>> val auth: Option[twitter4j.auth.Authorization] = Some(new >>>>>>> twitter4j.auth.OAuthAuthorization(config)) >>>>>>> val stream:DStream[Status] = TwitterUtils.createStream(ssc, >>>>>>> auth) >>>>>>> val filtered_statuses = stream.transform(rdd =>{ >>>>>>> val filtered = rdd.filter(status =>{ >>>>>>> var found = false >>>>>>> for(tag <- hashTagsList){ >>>>>>> if(status.getText.toLowerCase.contains(tag)) { >>>>>>> found = true >>>>>>> } >>>>>>> } >>>>>>> found >>>>>>> }) >>>>>>> filtered >>>>>>> }) >>>>>>> filtered_statuses.foreachRDD(rdd => { >>>>>>> rdd.collect.foreach(t => { >>>>>>> println(t) >>>>>>> }) >>>>>>> }) >>>>>>> ssc.start() >>>>>>> } >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> ******************************************************************************* >>> Zoran Jeremic, PhD >>> Senior System Analyst & Programmer >>> >>> Athabasca University >>> Tel: +1 604 92 89 944 >>> E-mail: zoran.jere...@gmail.com <zoran.jere...@va.mod.gov.rs> >>> Homepage: http://zoranjeremic.org >>> >>> ********************************************************************************** >>> >> >> > > > -- > > ******************************************************************************* > Zoran Jeremic, PhD > Senior System Analyst & Programmer > > Athabasca University > Tel: +1 604 92 89 944 > E-mail: zoran.jere...@gmail.com <zoran.jere...@va.mod.gov.rs> > Homepage: http://zoranjeremic.org > > ********************************************************************************** >