Re: Limitation of the current TweetParser

Mike Carey Mon, 22 Feb 2016 20:51:07 -0800

We should definitely not be pulling in a subset of fields at the entrypoint - that's what the UDF is for (it can trim off or add or convertfields) - agreed. Why not have the out-of-the-box adaptor simply keepall of the fields in their incoming form? Maybe something we'd need forextra credit would be - if the data is targeted at a dataset with "moreschema" then the incoming wide open records - the ability to do fieldlevel type conversions at the point of entry into a dataset by callingthe appropriate constructors with the incoming string values?


On 2/22/16 4:46 PM, Jianfeng Jia wrote:

Dear devs,

TwitterFeedAdapter is nice, but the internal TweetParser have some limitations.
1. We only pick a few JSON field, e.g. user, geolocation, message field. I need
the place field. Also there are also some other fields the other application
may also interested in.
2. The text fields always call getNormalizedString() to filter out the
non-ascii chars, which is a big loss of information. Even for the English txt
there are emojis which are not “nomal”

Apparently we can add the entire twitter structure into this parser. I’m
wondering if the current one-to-one mapping between Adapter and Parser design
is the best approach? The twitter data itself changes. Also there are a lot of
interesting open data resources, e.g. Instagram,FaceBook, Weibo, Reddit ….
Could we have a general approach for all these data sources?

I’m thinking to have some field level JSON to ADM parsers
(int,double,string,binary,point,time,polygon…). Then by given the schema option
through Adapter we can easily assemble the field into one record. The schema
option could be a field mapping between original JSON id and the ADM type, e.g.
{ “id”:Int64, “user”: { “userid”: int64,..} }. As such, we don’t have to write
the specific parser for different data source.

Another thoughts is to just give the JSON object as it is, and rely on the
user’s UDF to parse the data. Again, even in this case, user can selectively
override several field parsers that are different from ours.

Any thoughts?

Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Limitation of the current TweetParser

Reply via email to