Re: Limitation of the current TweetParser

Jianfeng Jia Mon, 22 Feb 2016 22:41:14 -0800

I’ve created an issue 1318 
<https://issues.apache.org/jira/browse/ASTERIXDB-1318> wrt recovering the 
missing fields from the Twitter Stream JSON.


As for the cast-record, if we can add advanced type converting that will be 
great. 

> On Feb 22, 2016, at 10:06 PM, Yingyi Bu <[email protected]> wrote:
> 
>>> Maybe something we'd need for extra credit would be - if the data is
> targeted at a dataset with "more schema" then the incoming wide open
> records - >> the ability to do field level type conversions at the point of
> entry into a dataset by calling the appropriate constructors with the
> incoming string values?
> 
> I guess we can have an enhanced version of the cast-record function to do
> that?  It already considers the combination of complex types,
> open-closeness, and type promotions.  Maybe we can to enhance that with
> temporal/spatial constructors?
> 
> Best,
> Yingyi
> 
> 
> On Mon, Feb 22, 2016 at 8:50 PM, Mike Carey <[email protected]> wrote:
> 
>> We should definitely not be pulling in a subset of fields at the entry
>> point - that's what the UDF is for (it can trim off or add or convert
>> fields) - agreed.  Why not have the out-of-the-box adaptor simply keep all
>> of the fields in their incoming form?  Maybe something we'd need for extra
>> credit would be - if the data is targeted at a dataset with "more schema"
>> then the incoming wide open records - the ability to do field level type
>> conversions at the point of entry into a dataset by calling the appropriate
>> constructors with the incoming string values?
>> 
>> 
>> On 2/22/16 4:46 PM, Jianfeng Jia wrote:
>> 
>>> Dear devs,
>>> 
>>> TwitterFeedAdapter is nice, but the internal TweetParser have some
>>> limitations.
>>> 1. We only pick a few JSON field, e.g. user, geolocation, message field.
>>> I need the place field. Also there are also some other fields the other
>>> application may also interested in.
>>> 2. The text fields always call getNormalizedString() to filter out the
>>> non-ascii chars, which is a big loss of information. Even for the English
>>> txt there are emojis which are not “nomal”
>>> 
>>> Apparently we can add the entire twitter structure into this parser. I’m
>>> wondering if the current one-to-one mapping between Adapter and Parser
>>> design is the best approach? The twitter data itself changes. Also there
>>> are a lot of interesting open data resources, e.g. Instagram,FaceBook,
>>> Weibo, Reddit ….  Could we have a general approach for all these data
>>> sources?
>>> 
>>> I’m thinking to have some field level JSON to ADM parsers
>>> (int,double,string,binary,point,time,polygon…). Then by given the schema
>>> option through Adapter we can easily assemble the field into one record.
>>> The schema option could be a field mapping between original JSON id and the
>>> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such, we
>>> don’t have to write the specific parser for different data source.
>>> 
>>> Another thoughts is to just give the JSON object as it is, and rely on
>>> the user’s UDF to parse the data. Again, even in this case, user can
>>> selectively override several field parsers that are different from ours.
>>> 
>>> Any thoughts?
>>> 
>>> 
>>> Best,
>>> 
>>> Jianfeng Jia
>>> PhD Candidate of Computer Science
>>> University of California, Irvine
>>> 
>>> 
>>> 
>> 



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Limitation of the current TweetParser

Reply via email to