Re: Schema Discovery Support in Apex Applications

Thomas Weise Fri, 03 Feb 2017 07:59:31 -0800

Agreed. As noted the main concern was the ability to support idempotency.
It isn't really "re-ordering" because when you have multiple input ports,
there isn't any ordering guarantee within a streaming window.


The end window boundary is good when the control tuple needs to be
processed after all associated data tuples (which is the case for
watermarks).

For schema it is the opposite, the schema needs to be seen before all data
tuples. The scenario of multiple input ports needs to be considered here as
well.

Thomas


On Thu, Feb 2, 2017 at 9:59 AM, Vlad Rozov <[email protected]> wrote:

> I second the proposal to revisit custom control tuple delivery and
> re-ordering. Schema support brings a use case that was missing when we
> discussed custom control tuples.
>
> Thank you,
>
> Vlad
>
>
> On 2/1/17 21:56, Pramod Immaneni wrote:
>
>> This can be done neatly and possibly completely outside the engine if we
>> are able to deliver schema information via the control tuple mechanism.
>> Current control tuple proposal reorders the control tuple to be delivered
>> at the end of the window to the operator. This would not be feasible for
>> schemas as the schema would need to be delivered before the data. If we
>> can
>> reconsider this behavior and consider not reordering the control tuple it
>> would work in this use case. We can have further discussions on the
>> scenarios this raises like what to do when there are multiple paths for
>> data, how control tuples get delivered to unifiers and look into
>> suggestions like synchronizing on control tuple boundaries and other ways
>> to solve these. What do you guys think?
>>
>> On Wed, Feb 1, 2017 at 8:27 PM, Thomas Weise <[email protected]> wrote:
>>
>> I think dynamic schema would be good to consider (schema known and
>>> possibly
>>> changing at runtime). Some applications cannot be written under the
>>> assumption that the schema is known upfront.
>>>
>>> Also, does this really need to leak into the engine? I think it would be
>>> good to consider alternatives and tradeoffs.
>>>
>>> Thomas
>>>
>>>
>>> On Mon, Jan 30, 2017 at 10:44 PM, Chinmay Kolhatkar <
>>> [email protected]
>>>
>>>> wrote:
>>>> Consumer of output port operator schema is going next downstream
>>>>
>>> operator.
>>>
>>>>
>>>> On Tue, Jan 31, 2017 at 4:01 AM, Sergey Golovko <[email protected]
>>>> >
>>>> wrote:
>>>>
>>>> Sorry, I’m a new person in the APEX team. And I don't understand
>>>>>
>>>> clearly
>>>
>>>> who are consumers of the output port operator schema(s).
>>>>>
>>>>> 1. If the consumers are non-run-time callers like the application
>>>>>
>>>> manager
>>>
>>>> or UI designer, maybe it makes sense to use Java static method(s) to
>>>>> retrieve the output port operator schema(s). I guess the performance
>>>>>
>>>> of a
>>>
>>>> single call of a static method via reflection can be ignored.
>>>>>
>>>>> 2. If the consumer is next downstream operator, maybe it makes sense to
>>>>> send an output port operator schema from upstream operator to next
>>>>> downstream operator via the stream. The corresponded methods that would
>>>>> send and receive the schema should be declared in the
>>>>> interface/abstract-class of the upstream and downstream operators. The
>>>>> sending/receiving of an output schema should be processed right before
>>>>>
>>>> the
>>>>
>>>>> sending of the first data record via the stream.
>>>>>
>>>>> One of examples of a typical implementation for sending of metadata
>>>>>
>>>> with
>>>
>>>> a
>>>>
>>>>> regular result set is the sending of JDBC metadata as a part of JDBC
>>>>>
>>>> result
>>>>
>>>>> set. And I hope the output schema (metadata of the streamed data) in
>>>>>
>>>> the
>>>
>>>> implementation should contain not only a signature of the streamed
>>>>>
>>>> objects
>>>>
>>>>> (like field names and data types), but also any other properties of the
>>>>> data that can be useful by the schema receiver to process the data (for
>>>>> instance, a delimiter for CSV record stream).
>>>>>
>>>>> Thanks,
>>>>> Sergey
>>>>>
>>>>> On 2017-01-25 01:47 (-0800), Chinmay Kolhatkar <
>>>>>
>>>> [email protected]>
>>>
>>>> wrote:
>>>>>
>>>>>> Thank you all for the feedback.
>>>>>>
>>>>>> I've created a Jira for this: APEXCORE-623 and I'll attach the same
>>>>>> document and link to this mailchain there.
>>>>>>
>>>>>> As a first part of this Jira, there are 2 steps I would like to
>>>>>>
>>>>> propose:
>>>>
>>>>> 1. Add following interface at com.datatorrent.common.util.
>>>>>>
>>>>> SchemaAware.
>>>
>>>> interface SchemaAware {
>>>>>>
>>>>>> Map<OutputPort, Schema> registerSchema(Map<InputPort, Schema>
>>>>>>
>>>>> inputSchema);
>>>>>
>>>>>> }
>>>>>>
>>>>>> This interface can be implemented by Operators to communicate its
>>>>>>
>>>>> output
>>>>
>>>>> schema(s) to engine.
>>>>>> Input to this schema will be schema at its input port.
>>>>>>
>>>>>> 2. After LogicalPlan is created call SchemaAware method from upstream
>>>>>>
>>>>> to
>>>>
>>>>> downstream operator in the DAG to propagate the Schema.
>>>>>>
>>>>>> Once this is done, changes can be done in Malhar for the operators in
>>>>>> question.
>>>>>>
>>>>>> Please share your opinion on this approach.
>>>>>>
>>>>>> Thanks,
>>>>>> Chinmay.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 18, 2017 at 2:31 PM, Priyanka Gugale <[email protected]>
>>>>>>
>>>>> wrote:
>>>>>
>>>>>> +1 to have this feature.
>>>>>>>
>>>>>>> -Priyanka
>>>>>>>
>>>>>>> On Tue, Jan 17, 2017 at 9:18 PM, Pramod Immaneni <
>>>>>>>
>>>>>> [email protected]>
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> +1
>>>>>>>>
>>>>>>>> On Mon, Jan 16, 2017 at 1:23 AM, Chinmay Kolhatkar <
>>>>>>>>
>>>>>>> [email protected]>
>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> Currently a DAG that is generated by user, if contains any
>>>>>>>>>
>>>>>>>> POJOfied
>>>>
>>>>> operators, TUPLE_CLASS attribute needs to be set on each and
>>>>>>>>>
>>>>>>>> every
>>>>
>>>>> port
>>>>>
>>>>>> which receives or sends a POJO.
>>>>>>>>>
>>>>>>>>> For e.g., if a DAG is like File -> Parser -> Transform -> Dedup
>>>>>>>>>
>>>>>>>> ->
>>>>
>>>>> Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set
>>>>>>>>>
>>>>>>>> by
>>>
>>>> user
>>>>>
>>>>>> on
>>>>>>>
>>>>>>>> both input and output ports of transform, dedup operators and
>>>>>>>>>
>>>>>>>> also
>>>>
>>>>> on
>>>>>
>>>>>> parser output and formatter input.
>>>>>>>>>
>>>>>>>>> The proposal here is to reduce work that is required by user to
>>>>>>>>>
>>>>>>>> configure
>>>>>>>
>>>>>>>> the DAG. Technically speaking if an operators knows input
>>>>>>>>>
>>>>>>>> schema
>>>
>>>> and
>>>>>
>>>>>> processing properties, it can determine output schema and
>>>>>>>>>
>>>>>>>> convey
>>>
>>>> it to
>>>>>
>>>>>> downstream operators. This way the complete pipeline can be
>>>>>>>>>
>>>>>>>> configured
>>>>>
>>>>>> without user setting TUPLE_CLASS or even creating POJOs and
>>>>>>>>>
>>>>>>>> adding
>>>>
>>>>> them
>>>>>
>>>>>> to
>>>>>>>>
>>>>>>>>> classpath.
>>>>>>>>>
>>>>>>>>> On the same idea, I want to propose an approach where the
>>>>>>>>>
>>>>>>>> pipeline
>>>>
>>>>> can
>>>>>
>>>>>> be
>>>>>>>
>>>>>>>> configured without user setting TUPLE_CLASS or even creating
>>>>>>>>>
>>>>>>>> POJOs
>>>>
>>>>> and
>>>>>
>>>>>> adding them to classpath.
>>>>>>>>> Here is the document which at a high level explains the idea
>>>>>>>>>
>>>>>>>> and
>>>
>>>> a
>>>>
>>>>> high
>>>>>
>>>>>> level design:
>>>>>>>>> https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_
>>>>>>>>> tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing
>>>>>>>>>
>>>>>>>>> I would like to get opinion from community about feasibility
>>>>>>>>>
>>>>>>>> and
>>>
>>>> applications of this proposal.
>>>>>>>>> Once we get some consensus we can discuss the design in
>>>>>>>>>
>>>>>>>> details.
>>>
>>>> Thanks,
>>>>>>>>> Chinmay.
>>>>>>>>>
>>>>>>>>>
>

Re: Schema Discovery Support in Apex Applications

Reply via email to