Re: Limitation of the current TweetParser

Mike Carey Tue, 23 Feb 2016 10:03:06 -0800

+1

On 2/23/16 9:26 AM, Chen Li wrote:

If the fields provided by twitter4j are good enough, I prefer option 1.  It
would be good to avoid a separate request to Twitter due to the overhead.


Chen

On Tue, Feb 23, 2016 at 12:13 AM, Jianfeng Jia <[email protected]>
wrote:

Good to know there is another request inside twitter4j.
I think given the popularity of twitter4j, if we can parse all the fields
in list 1 to ADM then it will be good enough.

On Feb 23, 2016, at 12:00 AM, abdullah alamoudi <[email protected]>

wrote:

Jianfeng,
We are using twitter4j api to get tweets as Status objects. I believe

that

twitter4j itself discards the original JSON when creating Status objects.
They provide a method to get the full json:

String rawJSON = DataObjectFactory.getRawJSON(status);

This method however sends another request to Twitter to get the original
JSON.
We have a few choices:
1. be okay with what twitter4j keeps {CreatedAt, Id, Text, Source,
isTruncated, InReplyToStatusId, InReplyToUserId, InReplyToScreenName,
GeoLocation, Place, isFavorited, isRetweeted, FavoriteCount, User,
isRetweet, RetweetedStatus, Contributors, RetweetCount, isRetweetedByMe,
CurrentUserRetweetId, PossiblySensitive, Lang,Scopes,

WithheldInCountries}.

However this means that we will not get additional feeds in case the

actual

data structure change. We can actually change this into JSON object using
the method above and then we can use our ADM parser to parse it.

2. Instead of relying on twitter4j, we should be able to get the JSON
objects directly using http requests to twitter. This way always gives us
the complete JSON object as it comes from twitter.com and we will get

new

fields the moment they are added.

I think either way should be fine and I actually think that we should

stick

to twitter4j for now and still use a specialized tweet parser which will
simply transform the objects fields into ADM fields unless there is a
strong need for fields that are not covered by the list in (1).

My 2c,
Abdullah.



On Tue, Feb 23, 2016 at 3:46 AM, Jianfeng Jia <[email protected]>
wrote:

Dear devs,

TwitterFeedAdapter is nice, but the internal TweetParser have some
limitations.
1. We only pick a few JSON field, e.g. user, geolocation, message

field. I

need the place field. Also there are also some other fields the other
application may also interested in.
2. The text fields always call getNormalizedString() to filter out the
non-ascii chars, which is a big loss of information. Even for the

English

txt there are emojis which are not “nomal”

Apparently we can add the entire twitter structure into this parser. I’m
wondering if the current one-to-one mapping between Adapter and Parser
design is the best approach? The twitter data itself changes. Also there
are a lot of interesting open data resources, e.g. Instagram,FaceBook,
Weibo, Reddit ….  Could we have a general approach for all these data
sources?

I’m thinking to have some field level JSON to ADM parsers
(int,double,string,binary,point,time,polygon…). Then by given the schema
option through Adapter we can easily assemble the field into one record.
The schema option could be a field mapping between original JSON id and

the

ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such,

we

don’t have to write the specific parser for different data source.

Another thoughts is to just give the JSON object as it is, and rely on

the

user’s UDF to parse the data. Again, even in this case, user can
selectively override several field parsers that are different from ours.

Any thoughts?


Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Limitation of the current TweetParser

Reply via email to