I’ve created an issue 1318 <https://issues.apache.org/jira/browse/ASTERIXDB-1318> wrt recovering the missing fields from the Twitter Stream JSON.
As for the cast-record, if we can add advanced type converting that will be great. > On Feb 22, 2016, at 10:06 PM, Yingyi Bu <[email protected]> wrote: > >>> Maybe something we'd need for extra credit would be - if the data is > targeted at a dataset with "more schema" then the incoming wide open > records - >> the ability to do field level type conversions at the point of > entry into a dataset by calling the appropriate constructors with the > incoming string values? > > I guess we can have an enhanced version of the cast-record function to do > that? It already considers the combination of complex types, > open-closeness, and type promotions. Maybe we can to enhance that with > temporal/spatial constructors? > > Best, > Yingyi > > > On Mon, Feb 22, 2016 at 8:50 PM, Mike Carey <[email protected]> wrote: > >> We should definitely not be pulling in a subset of fields at the entry >> point - that's what the UDF is for (it can trim off or add or convert >> fields) - agreed. Why not have the out-of-the-box adaptor simply keep all >> of the fields in their incoming form? Maybe something we'd need for extra >> credit would be - if the data is targeted at a dataset with "more schema" >> then the incoming wide open records - the ability to do field level type >> conversions at the point of entry into a dataset by calling the appropriate >> constructors with the incoming string values? >> >> >> On 2/22/16 4:46 PM, Jianfeng Jia wrote: >> >>> Dear devs, >>> >>> TwitterFeedAdapter is nice, but the internal TweetParser have some >>> limitations. >>> 1. We only pick a few JSON field, e.g. user, geolocation, message field. >>> I need the place field. Also there are also some other fields the other >>> application may also interested in. >>> 2. The text fields always call getNormalizedString() to filter out the >>> non-ascii chars, which is a big loss of information. Even for the English >>> txt there are emojis which are not “nomal” >>> >>> Apparently we can add the entire twitter structure into this parser. I’m >>> wondering if the current one-to-one mapping between Adapter and Parser >>> design is the best approach? The twitter data itself changes. Also there >>> are a lot of interesting open data resources, e.g. Instagram,FaceBook, >>> Weibo, Reddit …. Could we have a general approach for all these data >>> sources? >>> >>> I’m thinking to have some field level JSON to ADM parsers >>> (int,double,string,binary,point,time,polygon…). Then by given the schema >>> option through Adapter we can easily assemble the field into one record. >>> The schema option could be a field mapping between original JSON id and the >>> ADM type, e.g. { “id”:Int64, “user”: { “userid”: int64,..} }. As such, we >>> don’t have to write the specific parser for different data source. >>> >>> Another thoughts is to just give the JSON object as it is, and rely on >>> the user’s UDF to parse the data. Again, even in this case, user can >>> selectively override several field parsers that are different from ours. >>> >>> Any thoughts? >>> >>> >>> Best, >>> >>> Jianfeng Jia >>> PhD Candidate of Computer Science >>> University of California, Irvine >>> >>> >>> >> Best, Jianfeng Jia PhD Candidate of Computer Science University of California, Irvine
