Limitation of the current TweetParser

Jianfeng Jia Mon, 22 Feb 2016 16:47:23 -0800

Dear devs,

TwitterFeedAdapter is nice, but the internal TweetParser have some limitations. 
1. We only pick a few JSON field, e.g. user, geolocation, message field. I need 
the place field. Also there are also some other fields the other application 
may also interested in.
2. The text fields always call getNormalizedString() to filter out the 
non-ascii chars, which is a big loss of information. Even for the English txt 
there are emojis which are not “nomal”


Apparently we can add the entire twitter structure into this parser. I’m 
wondering if the current one-to-one mapping between Adapter and Parser design 
is the best approach? The twitter data itself changes. Also there are a lot of 
interesting open data resources, e.g. Instagram,FaceBook, Weibo, Reddit ….  
Could we have a general approach for all these data sources? 

I’m thinking to have some field level JSON to ADM parsers 
(int,double,string,binary,point,time,polygon…). Then by given the schema option 
through Adapter we can easily assemble the field into one record. The schema 
option could be a field mapping between original JSON id and the ADM type, e.g. 
{ “id”:Int64, “user”: { “userid”: int64,..} }. As such, we don’t have to write 
the specific parser for different data source. 

Another thoughts is to just give the JSON object as it is, and rely on the 
user’s UDF to parse the data. Again, even in this case, user can selectively 
override several field parsers that are different from ours. 

Any thoughts?


Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Limitation of the current TweetParser

Reply via email to