[ https://issues.apache.org/jira/browse/SPARK-13009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128843#comment-15128843 ]
Andrew Davidson commented on SPARK-13009: ----------------------------------------- I Sean I total agree with you. The Twitter4j people asked me to file a RFE with spark. I agree it is their problem. I just looking for some sort of work around. My down stream systems will not be able to process the data I am capturing. I guess in the short term I create the wrapper object and modify the spark twitter source code kind regards Andy > spark-streaming-twitter_2.10 does not make it possible to access the raw > twitter json > ------------------------------------------------------------------------------------- > > Key: SPARK-13009 > URL: https://issues.apache.org/jira/browse/SPARK-13009 > Project: Spark > Issue Type: Improvement > Components: Streaming > Affects Versions: 1.6.0 > Reporter: Andrew Davidson > Priority: Minor > > The Streaming-twitter package makes it easy for Java programmers to work with > twitter. The implementation returns the raw twitter data in JSON formate as a > twitter4J StatusJSONImpl object > JavaDStream<Status> tweets = TwitterUtils.createStream(ssc, twitterAuth); > The status class is different then the raw JSON. I.E. serializing the status > object will be the same as the original json. I have down stream systems that > can only process raw tweets not twitter4J Status objects. > Here is my bug/RFE request made to Twitter4J <twitte...@googlegroups.com>. > They asked I create a spark tracking issue. > On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote: > Hi All > Quick problem summary: > My system uses the Status objects to do some analysis how ever I need to > store the raw JSON. There are other systems that process that data that are > not written in Java. > Currently we are serializing the Status Object. The JSON is going to break > down stream systems. > I am using the Apache Spark Streaming spark-streaming-twitter_2.10 > http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources > Request For Enhancement: > I imagine easy access to the raw JSON is a common requirement. Would it be > possible to add a member function to StatusJSONImpl getRawJson(). By default > the returned value would be null unless jsonStoreEnabled=True is set in the > config. > Alternative implementations: > > It should be possible to modify the spark-streaming-twitter_2.10 to provide > this support. The solutions is not very clean > It would required apache spark to define their own Status Pojo. The current > StatusJSONImpl class is marked final > The Wrapper is not going to work nicely with existing code. > spark-streaming-twitter_2.10 does not expose all of the twitter streaming > API so many developers are writing their implementations of > org.apache.park.streaming.twitter.TwitterInputDStream. This make maintenance > difficult. Its not easy to know when the spark implementation for twitter has > changed. > Code listing for > spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala > private[streaming] > class TwitterReceiver( > twitterAuth: Authorization, > filters: Seq[String], > storageLevel: StorageLevel > ) extends Receiver[Status](storageLevel) with Logging { > @volatile private var twitterStream: TwitterStream = _ > @volatile private var stopped = false > def onStart() { > try { > val newTwitterStream = new > TwitterStreamFactory().getInstance(twitterAuth) > newTwitterStream.addListener(new StatusListener { > def onStatus(status: Status): Unit = { > store(status) > } > Ref: > https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html > What do people think? > Kind regards > Andy > From: <twit...@googlegroups.com> on behalf of Igor Brigadir > <igor.b...@ucdconnect.ie> > Reply-To: <twit...@googlegroups.com> > Date: Tuesday, January 19, 2016 at 5:55 AM > To: Twitter4J <twit...@googlegroups.com> > Subject: Re: [Twitter4J] trouble writing unit test > Main issue is that the Json object is in the wrong json format. > eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 > +0000 2015", ... > It looks like the json you have was serialized from a java Status object, > which makes json objects different to what you get from the API, > TwitterObjectFactory expects json from Twitter (I haven't had any problems > using TwitterObjectFactory instead of the Deprecated DataObjectFactory). > You could "fix" it by matching the keys & values you have with the correct, > twitter API json - it should look like the example here: > https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid > But it might be easier to download the tweets again, but this time use > TwitterObjectFactory.getRawJSON(status) to get the Original Json from the > Twitter API, and save that for later. (You must have jsonStoreEnabled=True in > your config, and call getRawJSON in the same thread as .showStatus() or > lookup() or whatever you're using to load tweets.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org