Re: Limitation of the current TweetParser

Yingyi Bu Mon, 22 Feb 2016 22:06:42 -0800

>> Maybe something we'd need for extra credit would be - if the data is
targeted at a dataset with "more schema" then the incoming wide open
records - >> the ability to do field level type conversions at the point of
entry into a dataset by calling the appropriate constructors with the
incoming string values?


I guess we can have an enhanced version of the cast-record function to do
that?  It already considers the combination of complex types,
open-closeness, and type promotions.  Maybe we can to enhance that with
temporal/spatial constructors?

Best,
Yingyi


On Mon, Feb 22, 2016 at 8:50 PM, Mike Carey <[email protected]> wrote:

> We should definitely not be pulling in a subset of fields at the entry
> point - that's what the UDF is for (it can trim off or add or convert
> fields) - agreed.  Why not have the out-of-the-box adaptor simply keep all
> of the fields in their incoming form?  Maybe something we'd need for extra
> credit would be - if the data is targeted at a dataset with "more schema"
> then the incoming wide open records - the ability to do field level type
> conversions at the point of entry into a dataset by calling the appropriate
> constructors with the incoming string values?
>
>
> On 2/22/16 4:46 PM, Jianfeng Jia wrote:
>
>> Dear devs,
>>
>> TwitterFeedAdapter is nice, but the internal TweetParser have some
>> limitations.
>> 1. We only pick a few JSON field, e.g. user, geolocation, message field.
>> I need the place field. Also there are also some other fields the other
>> application may also interested in.
>> 2. The text fields always call getNormalizedString() to filter out the
>> non-ascii chars, which is a big loss of information. Even for the English
>> txt there are emojis which are not “nomal”
>>
>> Apparently we can add the entire twitter structure into this parser. I’m
>> wondering if the current one-to-one mapping between Adapter and Parser
>> design is the best approach? The twitter data itself changes. Also there
>> are a lot of interesting open data resources, e.g. Instagram,FaceBook,
>> Weibo, Reddit ….  Could we have a general approach for all these data
>> sources?
>>
>> I’m thinking to have some field level JSON to ADM parsers
>> (int,double,string,binary,point,time,polygon…). Then by given the schema
>> option through Adapter we can easily assemble the field into one record.
>> The schema option could be a field mapping between original JSON id and the
>> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such, we
>> don’t have to write the specific parser for different data source.
>>
>> Another thoughts is to just give the JSON object as it is, and rely on
>> the user’s UDF to parse the data. Again, even in this case, user can
>> selectively override several field parsers that are different from ours.
>>
>> Any thoughts?
>>
>>
>> Best,
>>
>> Jianfeng Jia
>> PhD Candidate of Computer Science
>> University of California, Irvine
>>
>>
>>
>

Re: Limitation of the current TweetParser

Reply via email to