We should definitely not be pulling in a subset of fields at the entry point - that's what the UDF is for (it can trim off or add or convert fields) - agreed. Why not have the out-of-the-box adaptor simply keep all of the fields in their incoming form? Maybe something we'd need for extra credit would be - if the data is targeted at a dataset with "more schema" then the incoming wide open records - the ability to do field level type conversions at the point of entry into a dataset by calling the appropriate constructors with the incoming string values?

On 2/22/16 4:46 PM, Jianfeng Jia wrote:
Dear devs,

TwitterFeedAdapter is nice, but the internal TweetParser have some limitations.
1. We only pick a few JSON field, e.g. user, geolocation, message field. I need 
the place field. Also there are also some other fields the other application 
may also interested in.
2. The text fields always call getNormalizedString() to filter out the 
non-ascii chars, which is a big loss of information. Even for the English txt 
there are emojis which are not “nomal”

Apparently we can add the entire twitter structure into this parser. I’m 
wondering if the current one-to-one mapping between Adapter and Parser design 
is the best approach? The twitter data itself changes. Also there are a lot of 
interesting open data resources, e.g. Instagram,FaceBook, Weibo, Reddit ….  
Could we have a general approach for all these data sources?

I’m thinking to have some field level JSON to ADM parsers 
(int,double,string,binary,point,time,polygon…). Then by given the schema option 
through Adapter we can easily assemble the field into one record. The schema 
option could be a field mapping between original JSON id and the ADM type, e.g. 
{ “id”:Int64, “user”: { “userid”: int64,..} }. As such, we don’t have to write 
the specific parser for different data source.

Another thoughts is to just give the JSON object as it is, and rely on the 
user’s UDF to parse the data. Again, even in this case, user can selectively 
override several field parsers that are different from ours.

Any thoughts?


Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine



Reply via email to