Re: Schema Discovery Support in Apex Applications

Pramod Immaneni Fri, 03 Feb 2017 15:23:58 -0800

Will do that.

On Fri, Feb 3, 2017 at 3:15 PM, Vlad Rozov <v.ro...@datatorrent.com> wrote:


> IMO, it will be good to summarize schema use case and proposed approach to
> support it on the control tuple e-mail thread. Not everyone interested in
> the custom control tuple may be following schema support thread.
>
> Thank you,
>
> Vlad
>
>
> On 2/3/17 08:47, Pramod Immaneni wrote:
>
>> On Fri, Feb 3, 2017 at 7:59 AM, Thomas Weise <t...@apache.org> wrote:
>>
>> Agreed. As noted the main concern was the ability to support idempotency.
>>> It isn't really "re-ordering" because when you have multiple input ports,
>>> there isn't any ordering guarantee within a streaming window.
>>>
>>> The reordering I was referring to is the reordering that would happen
>> either within the container that receives the tuples or the one that sends
>> the tuples, by holding on to the control tuple(s) till the window boundary
>> and not the order in which data is received across the different paths.
>> Also, the idempotency concern is that the operator developer might make a
>> mistake by making the incorrect ordering assumption, that you mentioned,
>> and do the wrong thing isn't it? It is not that this approach will break
>> idempotency or make it not possible to achieve it.
>>
>> It looks like we have an agreement on this approach so far. Do folks see
>> the need to summarize this in the control tuple discussion thread as well.
>> I can do that.
>>
>> Thanks
>>
>>
>> The end window boundary is good when the control tuple needs to be
>>> processed after all associated data tuples (which is the case for
>>> watermarks).
>>>
>>> For schema it is the opposite, the schema needs to be seen before all
>>> data
>>> tuples. The scenario of multiple input ports needs to be considered here
>>> as
>>> well.
>>>
>>> Thomas
>>>
>>>
>>> On Thu, Feb 2, 2017 at 9:59 AM, Vlad Rozov <v.ro...@datatorrent.com>
>>> wrote:
>>>
>>> I second the proposal to revisit custom control tuple delivery and
>>>> re-ordering. Schema support brings a use case that was missing when we
>>>> discussed custom control tuples.
>>>>
>>>> Thank you,
>>>>
>>>> Vlad
>>>>
>>>>
>>>> On 2/1/17 21:56, Pramod Immaneni wrote:
>>>>
>>>> This can be done neatly and possibly completely outside the engine if we
>>>>> are able to deliver schema information via the control tuple mechanism.
>>>>> Current control tuple proposal reorders the control tuple to be
>>>>>
>>>> delivered
>>>
>>>> at the end of the window to the operator. This would not be feasible for
>>>>> schemas as the schema would need to be delivered before the data. If we
>>>>> can
>>>>> reconsider this behavior and consider not reordering the control tuple
>>>>>
>>>> it
>>>
>>>> would work in this use case. We can have further discussions on the
>>>>> scenarios this raises like what to do when there are multiple paths for
>>>>> data, how control tuples get delivered to unifiers and look into
>>>>> suggestions like synchronizing on control tuple boundaries and other
>>>>>
>>>> ways
>>>
>>>> to solve these. What do you guys think?
>>>>>
>>>>> On Wed, Feb 1, 2017 at 8:27 PM, Thomas Weise <t...@apache.org> wrote:
>>>>>
>>>>> I think dynamic schema would be good to consider (schema known and
>>>>>
>>>>>> possibly
>>>>>> changing at runtime). Some applications cannot be written under the
>>>>>> assumption that the schema is known upfront.
>>>>>>
>>>>>> Also, does this really need to leak into the engine? I think it would
>>>>>>
>>>>> be
>>>
>>>> good to consider alternatives and tradeoffs.
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 30, 2017 at 10:44 PM, Chinmay Kolhatkar <
>>>>>> chin...@datatorrent.com
>>>>>>
>>>>>> wrote:
>>>>>>> Consumer of output port operator schema is going next downstream
>>>>>>>
>>>>>>> operator.
>>>>>>
>>>>>> On Tue, Jan 31, 2017 at 4:01 AM, Sergey Golovko <
>>>>>>>
>>>>>> ser...@datatorrent.com
>>>
>>>> wrote:
>>>>>>>
>>>>>>> Sorry, I’m a new person in the APEX team. And I don't understand
>>>>>>> clearly
>>>>>>> who are consumers of the output port operator schema(s).
>>>>>>>
>>>>>>>> 1. If the consumers are non-run-time callers like the application
>>>>>>>>
>>>>>>>> manager
>>>>>>> or UI designer, maybe it makes sense to use Java static method(s) to
>>>>>>>
>>>>>>>> retrieve the output port operator schema(s). I guess the performance
>>>>>>>>
>>>>>>>> of a
>>>>>>> single call of a static method via reflection can be ignored.
>>>>>>>
>>>>>>>> 2. If the consumer is next downstream operator, maybe it makes sense
>>>>>>>>
>>>>>>> to
>>>
>>>> send an output port operator schema from upstream operator to next
>>>>>>>> downstream operator via the stream. The corresponded methods that
>>>>>>>>
>>>>>>> would
>>>
>>>> send and receive the schema should be declared in the
>>>>>>>> interface/abstract-class of the upstream and downstream operators.
>>>>>>>>
>>>>>>> The
>>>
>>>> sending/receiving of an output schema should be processed right
>>>>>>>>
>>>>>>> before
>>>
>>>> the
>>>>>>>
>>>>>>> sending of the first data record via the stream.
>>>>>>>>
>>>>>>>> One of examples of a typical implementation for sending of metadata
>>>>>>>>
>>>>>>>> with
>>>>>>> a
>>>>>>>
>>>>>>> regular result set is the sending of JDBC metadata as a part of JDBC
>>>>>>>>
>>>>>>>> result
>>>>>>>
>>>>>>> set. And I hope the output schema (metadata of the streamed data) in
>>>>>>>>
>>>>>>>> the
>>>>>>> implementation should contain not only a signature of the streamed
>>>>>>> objects
>>>>>>>
>>>>>>> (like field names and data types), but also any other properties of
>>>>>>>>
>>>>>>> the
>>>
>>>> data that can be useful by the schema receiver to process the data
>>>>>>>>
>>>>>>> (for
>>>
>>>> instance, a delimiter for CSV record stream).
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sergey
>>>>>>>>
>>>>>>>> On 2017-01-25 01:47 (-0800), Chinmay Kolhatkar <
>>>>>>>>
>>>>>>>> chin...@datatorrent.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you all for the feedback.
>>>>>>>>>
>>>>>>>>> I've created a Jira for this: APEXCORE-623 and I'll attach the same
>>>>>>>>> document and link to this mailchain there.
>>>>>>>>>
>>>>>>>>> As a first part of this Jira, there are 2 steps I would like to
>>>>>>>>>
>>>>>>>>> propose:
>>>>>>>> 1. Add following interface at com.datatorrent.common.util.
>>>>>>>> SchemaAware.
>>>>>>>>
>>>>>>> interface SchemaAware {
>>>>>>>
>>>>>>>> Map<OutputPort, Schema> registerSchema(Map<InputPort, Schema>
>>>>>>>>>
>>>>>>>>> inputSchema);
>>>>>>>>
>>>>>>>> }
>>>>>>>>>
>>>>>>>>> This interface can be implemented by Operators to communicate its
>>>>>>>>>
>>>>>>>>> output
>>>>>>>> schema(s) to engine.
>>>>>>>>
>>>>>>>>> Input to this schema will be schema at its input port.
>>>>>>>>>
>>>>>>>>> 2. After LogicalPlan is created call SchemaAware method from
>>>>>>>>>
>>>>>>>> upstream
>>>
>>>> to
>>>>>>>> downstream operator in the DAG to propagate the Schema.
>>>>>>>>
>>>>>>>>> Once this is done, changes can be done in Malhar for the operators
>>>>>>>>>
>>>>>>>> in
>>>
>>>> question.
>>>>>>>>>
>>>>>>>>> Please share your opinion on this approach.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Chinmay.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jan 18, 2017 at 2:31 PM, Priyanka Gugale <
>>>>>>>>> pri...@apache.org
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> +1 to have this feature.
>>>>>>>>>
>>>>>>>>>> -Priyanka
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 17, 2017 at 9:18 PM, Pramod Immaneni <
>>>>>>>>>>
>>>>>>>>>> pra...@datatorrent.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 16, 2017 at 1:23 AM, Chinmay Kolhatkar <
>>>>>>>>>>>
>>>>>>>>>>> chin...@apache.org>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>>> Currently a DAG that is generated by user, if contains any
>>>>>>>>>>>>
>>>>>>>>>>>> POJOfied
>>>>>>>>>>>
>>>>>>>>>> operators, TUPLE_CLASS attribute needs to be set on each and
>>>>>>>>
>>>>>>>>> every
>>>>>>>>>>>
>>>>>>>>>> port
>>>>>>>>
>>>>>>>> which receives or sends a POJO.
>>>>>>>>>
>>>>>>>>>> For e.g., if a DAG is like File -> Parser -> Transform -> Dedup
>>>>>>>>>>>>
>>>>>>>>>>>> ->
>>>>>>>>>>>
>>>>>>>>>> Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set
>>>>>>>>
>>>>>>>>> by
>>>>>>>>>>>
>>>>>>>>>> user
>>>>>>>
>>>>>>>> on
>>>>>>>>>
>>>>>>>>>> both input and output ports of transform, dedup operators and
>>>>>>>>>>> also
>>>>>>>>>>>
>>>>>>>>>> on
>>>>>>>>
>>>>>>>> parser output and formatter input.
>>>>>>>>>
>>>>>>>>>> The proposal here is to reduce work that is required by user to
>>>>>>>>>>>>
>>>>>>>>>>>> configure
>>>>>>>>>>> the DAG. Technically speaking if an operators knows input
>>>>>>>>>>> schema
>>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>
>>>>>>>> processing properties, it can determine output schema and
>>>>>>>>>
>>>>>>>>>> convey
>>>>>>>>>>>
>>>>>>>>>> it to
>>>>>>>
>>>>>>>> downstream operators. This way the complete pipeline can be
>>>>>>>>>
>>>>>>>>>> configured
>>>>>>>>>>>
>>>>>>>>>> without user setting TUPLE_CLASS or even creating POJOs and
>>>>>>>>>
>>>>>>>>>> adding
>>>>>>>>>>>
>>>>>>>>>> them
>>>>>>>>
>>>>>>>> to
>>>>>>>>>
>>>>>>>>>> classpath.
>>>>>>>>>>>>
>>>>>>>>>>>> On the same idea, I want to propose an approach where the
>>>>>>>>>>>>
>>>>>>>>>>>> pipeline
>>>>>>>>>>>
>>>>>>>>>> can
>>>>>>>>
>>>>>>>> be
>>>>>>>>>
>>>>>>>>>> configured without user setting TUPLE_CLASS or even creating
>>>>>>>>>>> POJOs
>>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>
>>>>>>>> adding them to classpath.
>>>>>>>>>
>>>>>>>>>> Here is the document which at a high level explains the idea
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>> a
>>>>>>>
>>>>>>> high
>>>>>>>>
>>>>>>>> level design:
>>>>>>>>>
>>>>>>>>>> https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_
>>>>>>>>>>>> tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to get opinion from community about feasibility
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>> applications of this proposal.
>>>>>>>
>>>>>>>> Once we get some consensus we can discuss the design in
>>>>>>>>>>>>
>>>>>>>>>>>> details.
>>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>
>>>>>>>> Chinmay.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>

Re: Schema Discovery Support in Apex Applications

Reply via email to