>> Maybe something we'd need for extra credit would be - if the data is targeted at a dataset with "more schema" then the incoming wide open records - >> the ability to do field level type conversions at the point of entry into a dataset by calling the appropriate constructors with the incoming string values?
I guess we can have an enhanced version of the cast-record function to do that? It already considers the combination of complex types, open-closeness, and type promotions. Maybe we can to enhance that with temporal/spatial constructors? Best, Yingyi On Mon, Feb 22, 2016 at 8:50 PM, Mike Carey <[email protected]> wrote: > We should definitely not be pulling in a subset of fields at the entry > point - that's what the UDF is for (it can trim off or add or convert > fields) - agreed. Why not have the out-of-the-box adaptor simply keep all > of the fields in their incoming form? Maybe something we'd need for extra > credit would be - if the data is targeted at a dataset with "more schema" > then the incoming wide open records - the ability to do field level type > conversions at the point of entry into a dataset by calling the appropriate > constructors with the incoming string values? > > > On 2/22/16 4:46 PM, Jianfeng Jia wrote: > >> Dear devs, >> >> TwitterFeedAdapter is nice, but the internal TweetParser have some >> limitations. >> 1. We only pick a few JSON field, e.g. user, geolocation, message field. >> I need the place field. Also there are also some other fields the other >> application may also interested in. >> 2. The text fields always call getNormalizedString() to filter out the >> non-ascii chars, which is a big loss of information. Even for the English >> txt there are emojis which are not “nomal” >> >> Apparently we can add the entire twitter structure into this parser. I’m >> wondering if the current one-to-one mapping between Adapter and Parser >> design is the best approach? The twitter data itself changes. Also there >> are a lot of interesting open data resources, e.g. Instagram,FaceBook, >> Weibo, Reddit …. Could we have a general approach for all these data >> sources? >> >> I’m thinking to have some field level JSON to ADM parsers >> (int,double,string,binary,point,time,polygon…). Then by given the schema >> option through Adapter we can easily assemble the field into one record. >> The schema option could be a field mapping between original JSON id and the >> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such, we >> don’t have to write the specific parser for different data source. >> >> Another thoughts is to just give the JSON object as it is, and rely on >> the user’s UDF to parse the data. Again, even in this case, user can >> selectively override several field parsers that are different from ours. >> >> Any thoughts? >> >> >> Best, >> >> Jianfeng Jia >> PhD Candidate of Computer Science >> University of California, Irvine >> >> >> >
