Re: Questions of building record in AsterixDB

Mike Carey Mon, 02 May 2016 07:06:08 -0700

Agreed.  Let's maximize correctness and minimize code and therefore avoid  
additional room for problems and the need for corresponding tests and dual 
maintenance of principles.  We can then shoot to reduce object generation but 
still with two stages of types (initial and casted- and any improvements to 
casting efficiency, within casting code itself, will benefit all data adding 
code paths).  Then we can measure performance.  In the early days of Hyracks 
and AsterixDB, Vinayak was a stickler for trying to avoid premature 
optimizations that had software engineering and/or complexity impacts ... Let's 
try to keep that philosophy alive!


Sent from my iPad

On May 1, 2016, at 2:38 PM, Yingyi Bu <[email protected]> wrote:

>>> - Using The SerializerDeserializer option, you will only create a single
>>> object regardless of the number of parsed records, so I wouldn't worry
>>> about it. Code maintainability takes precedence here IMO.
> 
> For scalar types, you're right.
> But for complex types, there are objects created in each
> serialize/deserialize call:
> 
> https://github.com/apache/incubator-asterixdb/blob/master/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/nontagged/serde/ARecordSerializerDeserializer.java#L176
> https://github.com/apache/incubator-asterixdb/blob/master/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/nontagged/serde/AOrderedListSerializerDeserializer.java#L106
> https://github.com/apache/incubator-asterixdb/blob/master/asterixdb/asterix-om/src/main/java/org/apache/asterix/dataflow/data/nontagged/serde/AUnorderedListSerializerDeserializer.java#L107
> 
>>> - Right now, we parse missing values as null. Should that change?
> I'll take care of that in my change:
> -- if missing is in the closed part, it becomes null;
> -- if missing is in the open part,  it does not exist in the record.
> 
>>> I am for NOT using the cast record solution for the overhead it will add.
>>> but that is just me :)
> Yes, there would be some overhead if we parse every record as a type-less
> open record first.
> But I think correctness is more important. I'm not sure whether the current
> AdmParser can handle all general type-castable cases.
> 
> Best,
> Yingyi
> 
> 
>> On Sat, Apr 30, 2016 at 6:10 PM, Xikui Wang <[email protected]> wrote:
>> 
>> Hi Mike,
>> 
>> Thanks for pointing that out. I think I misunderstood the working mechanism
>> and misused the terms of 'dataset' and 'datatype'. Sorry about that.
>> 
>> Best,
>> Xikui
>> 
>>> On Sat, Apr 30, 2016 at 4:30 PM, Mike Carey <[email protected]> wrote:
>>> 
>>> One nit:  This has nothing to do with any dataset definition, on the
>> parser
>>> side of things - it's the type parameter on the create feed DDL statement
>>> that should be the parser's guide.  (In general the optional function on
>>> the feed may change the type by the time the data reaches a dataset.)
>>>> On Apr 30, 2016 3:26 PM, "Xikui Wang" <[email protected]> wrote:
>>>> 
>>>> Hi Abdullah,
>>>> 
>>>> Actually I also have the concern that adding null-check for general
>> cases
>>>> will bring extra
>>>> overheads. Thus I plan to add the checking procedure after parser, but
>>>> before addTuple,
>>>> i.e.FeedRecordDataFlowController. But based on what I have seen so far,
>>> it
>>>> seems RecordType
>>>> is transparent to FeedRecordDataFlowController. So I am still
>>> investigating
>>>> that...
>>>> 
>>>> I saw the null check in ADM parser. That's actually a viable way to
>>> handle
>>>> that within the
>>>> parser scope. But I am looking for a slightly different solution. In my
>>>> perspective,
>>>> ADM parser assumes the input adm should conform with the dataset
>>>> definition.
>>>> Thus it's reasonable for it to throw a exception. For Tweetparser, if I
>>> saw
>>>> null value on non-null attribute, I will
>>>> discard the whole tweet directly, and may not even log it(as too many
>>>> tweets with null).
>>>> That's the reason why I want to put that in
>> FeedRecordDataFlowController,
>>>> since I didn't see
>>>> there is a good way to prevent record insert in parser except for throw
>>>> exception.
>>>> 
>>>> Not sure my opinion makes sense or not. Feel free to comment. :)
>>>> 
>>>> Best,
>>>> Xikui
>>>> 
>>>> On Sat, Apr 30, 2016 at 1:52 PM, abdullah alamoudi <[email protected]
>>> 
>>>> wrote:
>>>> 
>>>>> Adding a few points here:
>>>>> 
>>>>> My feeling is SerializerDeserializer offers another level of
>>> abstraction
>>>>> but with output I can write value directly without construct AType
>>>> object.
>>>>> I am wondering if there are any preferences over these two?
>>>>> 
>>>>> - Using The SerializerDeserializer option, you will only create a
>>> single
>>>>> object regardless of the number of parsed records, so I wouldn't
>> worry
>>>>> about it. Code maintainability takes precedence here IMO.
>>>>> - In addition to records and lists, UTF8StringSerializerDeserializer
>>> can
>>>> be
>>>>> stateful for the same reason (avoid creating lost of un-needed
>>> objects).
>>>> In
>>>>> fact, our parsers use the stateful UTF8StringSerializerDeserializer
>>>> since I
>>>>> noticed that using the stateless one creates lots of byte[] and
>>> triggers
>>>> GC
>>>>> over and over.
>>>>> - Right now, we parse missing values as null. Should that change?
>>>>> - There is definitely a check for nulls on non-nullable values at
>> least
>>>> in
>>>>> the ADM parser. There might be a bug however that makes it accept
>>>> explicit
>>>>> null values and that should be fixed.
>>>>> 
>>>>> I am for NOT using the cast record solution for the overhead it will
>>> add.
>>>>> but that is just me :)
>>>>> ~Abdullah.
>>>>> 
>>>>> 
>>>>>> On Sat, Apr 30, 2016 at 6:48 AM, Xikui Wang <[email protected]> wrote:
>>>>>> 
>>>>>> Thank you Yingyi. I will try to figure out a solution from that
>>>>> direction.
>>>>>> 
>>>>>> Best,
>>>>>> Xikui
>>>>>> 
>>>>>> On Fri, Apr 29, 2016 at 3:48 PM, Yingyi Bu <[email protected]>
>>> wrote:
>>>>>> 
>>>>>>> Yeah, I think so:-)
>>>>>>> 
>>>>>>> Best,
>>>>>>> Yingyi
>>>>>>> 
>>>>>>> On Fri, Apr 29, 2016 at 3:46 PM, Mike Carey <[email protected]>
>>>> wrote:
>>>>>>> 
>>>>>>>> This indeed might be cleaner?
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 4/29/16 3:28 PM, Yingyi Bu wrote:
>>>>>>>>> 
>>>>>>>>> I'm guessing that you can do similar things to
>>>> CastRecordDescriptor
>>>>>>>>>>> if you want to handle general cases in that region.
>>>>>>>>>> Or, you can inject a cast-record function in the loading
>>> pipeline
>>>>>>>>> so that you can defer the runtime-type-check/cast to that
>>> function
>>>>>>> instead
>>>>>>>>> of doing that in the parser.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Apr 29, 2016 at 3:25 PM, Yingyi Bu <
>> [email protected]>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> My answer is inlined.
>>>>>>>>>> 
>>>>>>>>>> My feeling is SerializerDeserializer offers another level of
>>>>>>> abstraction
>>>>>>>>>>>> but with output I can write value directly without
>> construct
>>>>> AType
>>>>>>>>>>> object.
>>>>>>>>>> 
>>>>>>>>>>> I am wondering if there are any preferences over these two?
>>>>>>>>>>> I agree with you. However, a SerializerDeserializer has to
>> be
>>>>>>> stateless,
>>>>>>>>>> hence it cannot be used at runtime for complex type objects
>>> such
>>>> as
>>>>>>>>>> records and lists,
>>>>>>>>>> because it will create a lot Java objects.
>>>>>>>>>> 
>>>>>>>>>> in other words, parser has to guarantee that the
>>>>>>>>>>>> processed records has to match the dataset
>>>>> definition(non-optional
>>>>>>>>>>>> attribute cannot have null value). I tried to assign null
>>> value
>>>>> to
>>>>>>>>>>> non-null
>>>>>>>>>> 
>>>>>>>>>>> attributes. It will be inserted successfully but read
>> records
>>>> will
>>>>>>> have
>>>>>>>>>>>> problem.
>>>>>>>>>>> That sounds right to me.  Please file a JIRA issue and
>> assign
>>> to
>>>>>> you (
>>>>>>>>>> if you're working on that).
>>>>>>>>>> I'm guessing that you can do similar things to
>>>> CastRecordDescriptor
>>>>>>>>>> if you want to handle general cases in that region.
>>>>>>>>>> 
>>>>>>>>>> 3. Set to null or skip
>>>>>>>>>>>> For optional(nullable) attributes, if I want to insert a
>>> record
>>>>>> with
>>>>>>>>>>> null
>>>>>>>>>> 
>>>>>>>>>>> value on that attribute. Should I assign null value or
>> should
>>> I
>>>>> just
>>>>>>>>>>> skip
>>>>>>>>>> 
>>>>>>>>>>> it? (Probably this is related to the missing attribute that
>>>> Yingyi
>>>>>>>>>>>> mentioned today?)
>>>>>>>>>>> Assign null value.
>>>>>>>>>> Missing means the field doesn't exist in a record at all.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Yingyi
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Apr 29, 2016 at 2:06 PM, Xikui Wang <[email protected]>
>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi devs,
>>>>>>>>>>> 
>>>>>>>>>>> I came across several questions while I was constructing
>>> records
>>>>> in
>>>>>>>>>>> AsterixDB.  Hope someone can help me clear the confusion. :)
>>>>>>>>>>> 
>>>>>>>>>>> 1. Write directly to data output or use
>> SerializerDeserializer
>>>>>>>>>>> I am working with AbstractDataParser now. I see people using
>>>>>> different
>>>>>>>>>>> ways
>>>>>>>>>>> to append attributes to data output. Either use:
>>>>>>>>>>> output.Write(typetag.serialize());
>>>>>>>>>>> output.WriteInt(0);
>>>>>>>>>>> to write into data output directly, or
>>>>>>>>>>> use AInt8SerializerDeserializer.serialize(int8Serde) to
>>>> serialize
>>>>> a
>>>>>>>>>>> AINT8
>>>>>>>>>>> instance to output. *SerializerDeserializer uses writeByte
>> to
>>>>> write
>>>>>>>>>>> output.
>>>>>>>>>>> 
>>>>>>>>>>> My feeling is SerializerDeserializer offers another level of
>>>>>>> abstraction
>>>>>>>>>>> but with output I can write value directly without construct
>>>> AType
>>>>>>>>>>> object.
>>>>>>>>>>> I am wondering if there are any preferences over these two?
>>>>>>>>>>> 
>>>>>>>>>>> 2. RecordType validation after parser but before add to
>> frame?
>>>>>>>>>>> My observation is after parser finish writing the output and
>>>> pass
>>>>> it
>>>>>>> to
>>>>>>>>>>> next level, there is no such validation that checks whether
>>>>>>> non-optional
>>>>>>>>>>> field is null or not. In other words, parser has to
>> guarantee
>>>> that
>>>>>> the
>>>>>>>>>>> processed records has to match the dataset
>>>> definition(non-optional
>>>>>>>>>>> attribute cannot have null value). I tried to assign null
>>> value
>>>> to
>>>>>>>>>>> non-null
>>>>>>>>>>> attributes. It will be inserted successfully but read
>> records
>>>> will
>>>>>>> have
>>>>>>>>>>> problem.
>>>>>>>>>>> 
>>>>>>>>>>> 3. Set to null or skip
>>>>>>>>>>> For optional(nullable) attributes, if I want to insert a
>>> record
>>>>> with
>>>>>>>>>>> null
>>>>>>>>>>> value on that attribute. Should I assign null value or
>> should
>>> I
>>>>> just
>>>>>>>>>>> skip
>>>>>>>>>>> it? (Probably this is related to the missing attribute that
>>>> Yingyi
>>>>>>>>>>> mentioned today?)
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for your help.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Xikui
>>

Re: Questions of building record in AsterixDB

Reply via email to