We should definitely not be pulling in a subset of fields at the entry
point - that's what the UDF is for (it can trim off or add or convert
fields) - agreed. Why not have the out-of-the-box adaptor simply keep
all of the fields in their incoming form? Maybe something we'd need for
extra credit would be - if the data is targeted at a dataset with "more
schema" then the incoming wide open records - the ability to do field
level type conversions at the point of entry into a dataset by calling
the appropriate constructors with the incoming string values?
On 2/22/16 4:46 PM, Jianfeng Jia wrote:
Dear devs,
TwitterFeedAdapter is nice, but the internal TweetParser have some limitations.
1. We only pick a few JSON field, e.g. user, geolocation, message field. I need
the place field. Also there are also some other fields the other application
may also interested in.
2. The text fields always call getNormalizedString() to filter out the
non-ascii chars, which is a big loss of information. Even for the English txt
there are emojis which are not “nomal”
Apparently we can add the entire twitter structure into this parser. I’m
wondering if the current one-to-one mapping between Adapter and Parser design
is the best approach? The twitter data itself changes. Also there are a lot of
interesting open data resources, e.g. Instagram,FaceBook, Weibo, Reddit ….
Could we have a general approach for all these data sources?
I’m thinking to have some field level JSON to ADM parsers
(int,double,string,binary,point,time,polygon…). Then by given the schema option
through Adapter we can easily assemble the field into one record. The schema
option could be a field mapping between original JSON id and the ADM type, e.g.
{ “id”:Int64, “user”: { “userid”: int64,..} }. As such, we don’t have to write
the specific parser for different data source.
Another thoughts is to just give the JSON object as it is, and rely on the
user’s UDF to parse the data. Again, even in this case, user can selectively
override several field parsers that are different from ours.
Any thoughts?
Best,
Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine