Dear devs,
TwitterFeedAdapter is nice, but the internal TweetParser have some limitations.
1. We only pick a few JSON field, e.g. user, geolocation, message field. I need
the place field. Also there are also some other fields the other application
may also interested in.
2. The text fields always call getNormalizedString() to filter out the
non-ascii chars, which is a big loss of information. Even for the English txt
there are emojis which are not “nomal”
Apparently we can add the entire twitter structure into this parser. I’m
wondering if the current one-to-one mapping between Adapter and Parser design
is the best approach? The twitter data itself changes. Also there are a lot of
interesting open data resources, e.g. Instagram,FaceBook, Weibo, Reddit ….
Could we have a general approach for all these data sources?
I’m thinking to have some field level JSON to ADM parsers
(int,double,string,binary,point,time,polygon…). Then by given the schema option
through Adapter we can easily assemble the field into one record. The schema
option could be a field mapping between original JSON id and the ADM type, e.g.
{ “id”:Int64, “user”: { “userid”: int64,..} }. As such, we don’t have to write
the specific parser for different data source.
Another thoughts is to just give the JSON object as it is, and rely on the
user’s UDF to parse the data. Again, even in this case, user can selectively
override several field parsers that are different from ours.
Any thoughts?
Best,
Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine