Re: Questions of building record in AsterixDB

Yingyi Bu Sun, 01 May 2016 14:39:06 -0700

>> - Using The SerializerDeserializer option, you will only create a single
>> object regardless of the number of parsed records, so I wouldn't worry
>> about it. Code maintainability takes precedence here IMO.


For scalar types, you're right.
But for complex types, there are objects created in each
serialize/deserialize call:

https://github.com/apache/incubator-asterixdb/blob/master/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/nontagged/serde/ARecordSerializerDeserializer.java#L176
https://github.com/apache/incubator-asterixdb/blob/master/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/nontagged/serde/AOrderedListSerializerDeserializer.java#L106
https://github.com/apache/incubator-asterixdb/blob/master/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/nontagged/serde/AUnorderedListSerializerDeserializer.java#L107

>> - Right now, we parse missing values as null. Should that change?
I'll take care of that in my change:
-- if missing is in the closed part, it becomes null;
-- if missing is in the open part,  it does not exist in the record.

>> I am for NOT using the cast record solution for the overhead it will add.
>> but that is just me :)
Yes, there would be some overhead if we parse every record as a type-less
open record first.
But I think correctness is more important. I'm not sure whether the current
AdmParser can handle all general type-castable cases.

Best,
Yingyi


On Sat, Apr 30, 2016 at 6:10 PM, Xikui Wang <[email protected]> wrote:

> Hi Mike,
>
> Thanks for pointing that out. I think I misunderstood the working mechanism
> and misused the terms of 'dataset' and 'datatype'. Sorry about that.
>
> Best,
> Xikui
>
> On Sat, Apr 30, 2016 at 4:30 PM, Mike Carey <[email protected]> wrote:
>
> > One nit:  This has nothing to do with any dataset definition, on the
> parser
> > side of things - it's the type parameter on the create feed DDL statement
> > that should be the parser's guide.  (In general the optional function on
> > the feed may change the type by the time the data reaches a dataset.)
> > On Apr 30, 2016 3:26 PM, "Xikui Wang" <[email protected]> wrote:
> >
> > > Hi Abdullah,
> > >
> > > Actually I also have the concern that adding null-check for general
> cases
> > > will bring extra
> > > overheads. Thus I plan to add the checking procedure after parser, but
> > > before addTuple,
> > > i.e.FeedRecordDataFlowController. But based on what I have seen so far,
> > it
> > > seems RecordType
> > > is transparent to FeedRecordDataFlowController. So I am still
> > investigating
> > > that...
> > >
> > > I saw the null check in ADM parser. That's actually a viable way to
> > handle
> > > that within the
> > > parser scope. But I am looking for a slightly different solution. In my
> > > perspective,
> > > ADM parser assumes the input adm should conform with the dataset
> > > definition.
> > > Thus it's reasonable for it to throw a exception. For Tweetparser, if I
> > saw
> > > null value on non-null attribute, I will
> > > discard the whole tweet directly, and may not even log it(as too many
> > > tweets with null).
> > > That's the reason why I want to put that in
> FeedRecordDataFlowController,
> > > since I didn't see
> > > there is a good way to prevent record insert in parser except for throw
> > > exception.
> > >
> > > Not sure my opinion makes sense or not. Feel free to comment. :)
> > >
> > > Best,
> > > Xikui
> > >
> > > On Sat, Apr 30, 2016 at 1:52 PM, abdullah alamoudi <[email protected]
> >
> > > wrote:
> > >
> > > > Adding a few points here:
> > > >
> > > > My feeling is SerializerDeserializer offers another level of
> > abstraction
> > > > but with output I can write value directly without construct AType
> > > object.
> > > > I am wondering if there are any preferences over these two?
> > > >
> > > > - Using The SerializerDeserializer option, you will only create a
> > single
> > > > object regardless of the number of parsed records, so I wouldn't
> worry
> > > > about it. Code maintainability takes precedence here IMO.
> > > > - In addition to records and lists, UTF8StringSerializerDeserializer
> > can
> > > be
> > > > stateful for the same reason (avoid creating lost of un-needed
> > objects).
> > > In
> > > > fact, our parsers use the stateful UTF8StringSerializerDeserializer
> > > since I
> > > > noticed that using the stateless one creates lots of byte[] and
> > triggers
> > > GC
> > > > over and over.
> > > > - Right now, we parse missing values as null. Should that change?
> > > > - There is definitely a check for nulls on non-nullable values at
> least
> > > in
> > > > the ADM parser. There might be a bug however that makes it accept
> > > explicit
> > > > null values and that should be fixed.
> > > >
> > > > I am for NOT using the cast record solution for the overhead it will
> > add.
> > > > but that is just me :)
> > > > ~Abdullah.
> > > >
> > > >
> > > > On Sat, Apr 30, 2016 at 6:48 AM, Xikui Wang <[email protected]> wrote:
> > > >
> > > > > Thank you Yingyi. I will try to figure out a solution from that
> > > > direction.
> > > > >
> > > > > Best,
> > > > > Xikui
> > > > >
> > > > > On Fri, Apr 29, 2016 at 3:48 PM, Yingyi Bu <[email protected]>
> > wrote:
> > > > >
> > > > > > Yeah, I think so:-)
> > > > > >
> > > > > > Best,
> > > > > > Yingyi
> > > > > >
> > > > > > On Fri, Apr 29, 2016 at 3:46 PM, Mike Carey <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > This indeed might be cleaner?
> > > > > > >
> > > > > > >
> > > > > > > On 4/29/16 3:28 PM, Yingyi Bu wrote:
> > > > > > >
> > > > > > >> I'm guessing that you can do similar things to
> > > CastRecordDescriptor
> > > > > > >>>> if you want to handle general cases in that region.
> > > > > > >>>>
> > > > > > >>> Or, you can inject a cast-record function in the loading
> > pipeline
> > > > > > >> so that you can defer the runtime-type-check/cast to that
> > function
> > > > > > instead
> > > > > > >> of doing that in the parser.
> > > > > > >>
> > > > > > >>
> > > > > > >> On Fri, Apr 29, 2016 at 3:25 PM, Yingyi Bu <
> [email protected]>
> > > > > wrote:
> > > > > > >>
> > > > > > >> My answer is inlined.
> > > > > > >>>
> > > > > > >>> My feeling is SerializerDeserializer offers another level of
> > > > > > abstraction
> > > > > > >>>>> but with output I can write value directly without
> construct
> > > > AType
> > > > > > >>>>>
> > > > > > >>>> object.
> > > > > > >>>
> > > > > > >>>> I am wondering if there are any preferences over these two?
> > > > > > >>>>>
> > > > > > >>>> I agree with you. However, a SerializerDeserializer has to
> be
> > > > > > stateless,
> > > > > > >>> hence it cannot be used at runtime for complex type objects
> > such
> > > as
> > > > > > >>> records and lists,
> > > > > > >>> because it will create a lot Java objects.
> > > > > > >>>
> > > > > > >>> in other words, parser has to guarantee that the
> > > > > > >>>>> processed records has to match the dataset
> > > > definition(non-optional
> > > > > > >>>>> attribute cannot have null value). I tried to assign null
> > value
> > > > to
> > > > > > >>>>>
> > > > > > >>>> non-null
> > > > > > >>>
> > > > > > >>>> attributes. It will be inserted successfully but read
> records
> > > will
> > > > > > have
> > > > > > >>>>> problem.
> > > > > > >>>>>
> > > > > > >>>> That sounds right to me.  Please file a JIRA issue and
> assign
> > to
> > > > > you (
> > > > > > >>> if you're working on that).
> > > > > > >>> I'm guessing that you can do similar things to
> > > CastRecordDescriptor
> > > > > > >>> if you want to handle general cases in that region.
> > > > > > >>>
> > > > > > >>> 3. Set to null or skip
> > > > > > >>>>> For optional(nullable) attributes, if I want to insert a
> > record
> > > > > with
> > > > > > >>>>>
> > > > > > >>>> null
> > > > > > >>>
> > > > > > >>>> value on that attribute. Should I assign null value or
> should
> > I
> > > > just
> > > > > > >>>>>
> > > > > > >>>> skip
> > > > > > >>>
> > > > > > >>>> it? (Probably this is related to the missing attribute that
> > > Yingyi
> > > > > > >>>>> mentioned today?)
> > > > > > >>>>>
> > > > > > >>>> Assign null value.
> > > > > > >>> Missing means the field doesn't exist in a record at all.
> > > > > > >>>
> > > > > > >>> Best,
> > > > > > >>> Yingyi
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On Fri, Apr 29, 2016 at 2:06 PM, Xikui Wang <[email protected]>
> > > > wrote:
> > > > > > >>>
> > > > > > >>> Hi devs,
> > > > > > >>>>
> > > > > > >>>> I came across several questions while I was constructing
> > records
> > > > in
> > > > > > >>>> AsterixDB.  Hope someone can help me clear the confusion. :)
> > > > > > >>>>
> > > > > > >>>> 1. Write directly to data output or use
> SerializerDeserializer
> > > > > > >>>> I am working with AbstractDataParser now. I see people using
> > > > > different
> > > > > > >>>> ways
> > > > > > >>>> to append attributes to data output. Either use:
> > > > > > >>>> output.Write(typetag.serialize());
> > > > > > >>>> output.WriteInt(0);
> > > > > > >>>> to write into data output directly, or
> > > > > > >>>> use AInt8SerializerDeserializer.serialize(int8Serde) to
> > > serialize
> > > > a
> > > > > > >>>> AINT8
> > > > > > >>>> instance to output. *SerializerDeserializer uses writeByte
> to
> > > > write
> > > > > > >>>> output.
> > > > > > >>>>
> > > > > > >>>> My feeling is SerializerDeserializer offers another level of
> > > > > > abstraction
> > > > > > >>>> but with output I can write value directly without construct
> > > AType
> > > > > > >>>> object.
> > > > > > >>>> I am wondering if there are any preferences over these two?
> > > > > > >>>>
> > > > > > >>>> 2. RecordType validation after parser but before add to
> frame?
> > > > > > >>>> My observation is after parser finish writing the output and
> > > pass
> > > > it
> > > > > > to
> > > > > > >>>> next level, there is no such validation that checks whether
> > > > > > non-optional
> > > > > > >>>> field is null or not. In other words, parser has to
> guarantee
> > > that
> > > > > the
> > > > > > >>>> processed records has to match the dataset
> > > definition(non-optional
> > > > > > >>>> attribute cannot have null value). I tried to assign null
> > value
> > > to
> > > > > > >>>> non-null
> > > > > > >>>> attributes. It will be inserted successfully but read
> records
> > > will
> > > > > > have
> > > > > > >>>> problem.
> > > > > > >>>>
> > > > > > >>>> 3. Set to null or skip
> > > > > > >>>> For optional(nullable) attributes, if I want to insert a
> > record
> > > > with
> > > > > > >>>> null
> > > > > > >>>> value on that attribute. Should I assign null value or
> should
> > I
> > > > just
> > > > > > >>>> skip
> > > > > > >>>> it? (Probably this is related to the missing attribute that
> > > Yingyi
> > > > > > >>>> mentioned today?)
> > > > > > >>>>
> > > > > > >>>> Thanks for your help.
> > > > > > >>>>
> > > > > > >>>> Best,
> > > > > > >>>> Xikui
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Questions of building record in AsterixDB

Reply via email to