Re: Limitation of the current TweetParser

Jianfeng Jia Tue, 23 Feb 2016 00:19:31 -0800

Good to know there is another request inside twitter4j. 
I think given the popularity of twitter4j, if we can parse all the fields in 
list 1 to ADM then it will be good enough.


> On Feb 23, 2016, at 12:00 AM, abdullah alamoudi <[email protected]> wrote:
> 
> Jianfeng,
> We are using twitter4j api to get tweets as Status objects. I believe that
> twitter4j itself discards the original JSON when creating Status objects.
> They provide a method to get the full json:
> 
> String rawJSON = DataObjectFactory.getRawJSON(status);
> 
> This method however sends another request to Twitter to get the original
> JSON.
> We have a few choices:
> 1. be okay with what twitter4j keeps {CreatedAt, Id, Text, Source,
> isTruncated, InReplyToStatusId, InReplyToUserId, InReplyToScreenName,
> GeoLocation, Place, isFavorited, isRetweeted, FavoriteCount, User,
> isRetweet, RetweetedStatus, Contributors, RetweetCount, isRetweetedByMe,
> CurrentUserRetweetId, PossiblySensitive, Lang,Scopes, WithheldInCountries}.
> However this means that we will not get additional feeds in case the actual
> data structure change. We can actually change this into JSON object using
> the method above and then we can use our ADM parser to parse it.
> 
> 2. Instead of relying on twitter4j, we should be able to get the JSON
> objects directly using http requests to twitter. This way always gives us
> the complete JSON object as it comes from twitter.com and we will get new
> fields the moment they are added.
> 
> I think either way should be fine and I actually think that we should stick
> to twitter4j for now and still use a specialized tweet parser which will
> simply transform the objects fields into ADM fields unless there is a
> strong need for fields that are not covered by the list in (1).
> 
> My 2c,
> Abdullah.
> 
> 
> 
> On Tue, Feb 23, 2016 at 3:46 AM, Jianfeng Jia <[email protected]>
> wrote:
> 
>> Dear devs,
>> 
>> TwitterFeedAdapter is nice, but the internal TweetParser have some
>> limitations.
>> 1. We only pick a few JSON field, e.g. user, geolocation, message field. I
>> need the place field. Also there are also some other fields the other
>> application may also interested in.
>> 2. The text fields always call getNormalizedString() to filter out the
>> non-ascii chars, which is a big loss of information. Even for the English
>> txt there are emojis which are not “nomal”
>> 
>> Apparently we can add the entire twitter structure into this parser. I’m
>> wondering if the current one-to-one mapping between Adapter and Parser
>> design is the best approach? The twitter data itself changes. Also there
>> are a lot of interesting open data resources, e.g. Instagram,FaceBook,
>> Weibo, Reddit ….  Could we have a general approach for all these data
>> sources?
>> 
>> I’m thinking to have some field level JSON to ADM parsers
>> (int,double,string,binary,point,time,polygon…). Then by given the schema
>> option through Adapter we can easily assemble the field into one record.
>> The schema option could be a field mapping between original JSON id and the
>> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such, we
>> don’t have to write the specific parser for different data source.
>> 
>> Another thoughts is to just give the JSON object as it is, and rely on the
>> user’s UDF to parse the data. Again, even in this case, user can
>> selectively override several field parsers that are different from ours.
>> 
>> Any thoughts?
>> 
>> 
>> Best,
>> 
>> Jianfeng Jia
>> PhD Candidate of Computer Science
>> University of California, Irvine
>> 
>> 



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Limitation of the current TweetParser

Reply via email to