Will do that. On Fri, Feb 3, 2017 at 3:15 PM, Vlad Rozov <v.ro...@datatorrent.com> wrote:
> IMO, it will be good to summarize schema use case and proposed approach to > support it on the control tuple e-mail thread. Not everyone interested in > the custom control tuple may be following schema support thread. > > Thank you, > > Vlad > > > On 2/3/17 08:47, Pramod Immaneni wrote: > >> On Fri, Feb 3, 2017 at 7:59 AM, Thomas Weise <t...@apache.org> wrote: >> >> Agreed. As noted the main concern was the ability to support idempotency. >>> It isn't really "re-ordering" because when you have multiple input ports, >>> there isn't any ordering guarantee within a streaming window. >>> >>> The reordering I was referring to is the reordering that would happen >> either within the container that receives the tuples or the one that sends >> the tuples, by holding on to the control tuple(s) till the window boundary >> and not the order in which data is received across the different paths. >> Also, the idempotency concern is that the operator developer might make a >> mistake by making the incorrect ordering assumption, that you mentioned, >> and do the wrong thing isn't it? It is not that this approach will break >> idempotency or make it not possible to achieve it. >> >> It looks like we have an agreement on this approach so far. Do folks see >> the need to summarize this in the control tuple discussion thread as well. >> I can do that. >> >> Thanks >> >> >> The end window boundary is good when the control tuple needs to be >>> processed after all associated data tuples (which is the case for >>> watermarks). >>> >>> For schema it is the opposite, the schema needs to be seen before all >>> data >>> tuples. The scenario of multiple input ports needs to be considered here >>> as >>> well. >>> >>> Thomas >>> >>> >>> On Thu, Feb 2, 2017 at 9:59 AM, Vlad Rozov <v.ro...@datatorrent.com> >>> wrote: >>> >>> I second the proposal to revisit custom control tuple delivery and >>>> re-ordering. Schema support brings a use case that was missing when we >>>> discussed custom control tuples. >>>> >>>> Thank you, >>>> >>>> Vlad >>>> >>>> >>>> On 2/1/17 21:56, Pramod Immaneni wrote: >>>> >>>> This can be done neatly and possibly completely outside the engine if we >>>>> are able to deliver schema information via the control tuple mechanism. >>>>> Current control tuple proposal reorders the control tuple to be >>>>> >>>> delivered >>> >>>> at the end of the window to the operator. This would not be feasible for >>>>> schemas as the schema would need to be delivered before the data. If we >>>>> can >>>>> reconsider this behavior and consider not reordering the control tuple >>>>> >>>> it >>> >>>> would work in this use case. We can have further discussions on the >>>>> scenarios this raises like what to do when there are multiple paths for >>>>> data, how control tuples get delivered to unifiers and look into >>>>> suggestions like synchronizing on control tuple boundaries and other >>>>> >>>> ways >>> >>>> to solve these. What do you guys think? >>>>> >>>>> On Wed, Feb 1, 2017 at 8:27 PM, Thomas Weise <t...@apache.org> wrote: >>>>> >>>>> I think dynamic schema would be good to consider (schema known and >>>>> >>>>>> possibly >>>>>> changing at runtime). Some applications cannot be written under the >>>>>> assumption that the schema is known upfront. >>>>>> >>>>>> Also, does this really need to leak into the engine? I think it would >>>>>> >>>>> be >>> >>>> good to consider alternatives and tradeoffs. >>>>>> >>>>>> Thomas >>>>>> >>>>>> >>>>>> On Mon, Jan 30, 2017 at 10:44 PM, Chinmay Kolhatkar < >>>>>> chin...@datatorrent.com >>>>>> >>>>>> wrote: >>>>>>> Consumer of output port operator schema is going next downstream >>>>>>> >>>>>>> operator. >>>>>> >>>>>> On Tue, Jan 31, 2017 at 4:01 AM, Sergey Golovko < >>>>>>> >>>>>> ser...@datatorrent.com >>> >>>> wrote: >>>>>>> >>>>>>> Sorry, I’m a new person in the APEX team. And I don't understand >>>>>>> clearly >>>>>>> who are consumers of the output port operator schema(s). >>>>>>> >>>>>>>> 1. If the consumers are non-run-time callers like the application >>>>>>>> >>>>>>>> manager >>>>>>> or UI designer, maybe it makes sense to use Java static method(s) to >>>>>>> >>>>>>>> retrieve the output port operator schema(s). I guess the performance >>>>>>>> >>>>>>>> of a >>>>>>> single call of a static method via reflection can be ignored. >>>>>>> >>>>>>>> 2. If the consumer is next downstream operator, maybe it makes sense >>>>>>>> >>>>>>> to >>> >>>> send an output port operator schema from upstream operator to next >>>>>>>> downstream operator via the stream. The corresponded methods that >>>>>>>> >>>>>>> would >>> >>>> send and receive the schema should be declared in the >>>>>>>> interface/abstract-class of the upstream and downstream operators. >>>>>>>> >>>>>>> The >>> >>>> sending/receiving of an output schema should be processed right >>>>>>>> >>>>>>> before >>> >>>> the >>>>>>> >>>>>>> sending of the first data record via the stream. >>>>>>>> >>>>>>>> One of examples of a typical implementation for sending of metadata >>>>>>>> >>>>>>>> with >>>>>>> a >>>>>>> >>>>>>> regular result set is the sending of JDBC metadata as a part of JDBC >>>>>>>> >>>>>>>> result >>>>>>> >>>>>>> set. And I hope the output schema (metadata of the streamed data) in >>>>>>>> >>>>>>>> the >>>>>>> implementation should contain not only a signature of the streamed >>>>>>> objects >>>>>>> >>>>>>> (like field names and data types), but also any other properties of >>>>>>>> >>>>>>> the >>> >>>> data that can be useful by the schema receiver to process the data >>>>>>>> >>>>>>> (for >>> >>>> instance, a delimiter for CSV record stream). >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Sergey >>>>>>>> >>>>>>>> On 2017-01-25 01:47 (-0800), Chinmay Kolhatkar < >>>>>>>> >>>>>>>> chin...@datatorrent.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thank you all for the feedback. >>>>>>>>> >>>>>>>>> I've created a Jira for this: APEXCORE-623 and I'll attach the same >>>>>>>>> document and link to this mailchain there. >>>>>>>>> >>>>>>>>> As a first part of this Jira, there are 2 steps I would like to >>>>>>>>> >>>>>>>>> propose: >>>>>>>> 1. Add following interface at com.datatorrent.common.util. >>>>>>>> SchemaAware. >>>>>>>> >>>>>>> interface SchemaAware { >>>>>>> >>>>>>>> Map<OutputPort, Schema> registerSchema(Map<InputPort, Schema> >>>>>>>>> >>>>>>>>> inputSchema); >>>>>>>> >>>>>>>> } >>>>>>>>> >>>>>>>>> This interface can be implemented by Operators to communicate its >>>>>>>>> >>>>>>>>> output >>>>>>>> schema(s) to engine. >>>>>>>> >>>>>>>>> Input to this schema will be schema at its input port. >>>>>>>>> >>>>>>>>> 2. After LogicalPlan is created call SchemaAware method from >>>>>>>>> >>>>>>>> upstream >>> >>>> to >>>>>>>> downstream operator in the DAG to propagate the Schema. >>>>>>>> >>>>>>>>> Once this is done, changes can be done in Malhar for the operators >>>>>>>>> >>>>>>>> in >>> >>>> question. >>>>>>>>> >>>>>>>>> Please share your opinion on this approach. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Chinmay. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 18, 2017 at 2:31 PM, Priyanka Gugale < >>>>>>>>> pri...@apache.org >>>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> +1 to have this feature. >>>>>>>>> >>>>>>>>>> -Priyanka >>>>>>>>>> >>>>>>>>>> On Tue, Jan 17, 2017 at 9:18 PM, Pramod Immaneni < >>>>>>>>>> >>>>>>>>>> pra...@datatorrent.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> +1 >>>>>>>>>> >>>>>>>>>>> On Mon, Jan 16, 2017 at 1:23 AM, Chinmay Kolhatkar < >>>>>>>>>>> >>>>>>>>>>> chin...@apache.org> >>>>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi All, >>>>>>>>>>> >>>>>>>>>>>> Currently a DAG that is generated by user, if contains any >>>>>>>>>>>> >>>>>>>>>>>> POJOfied >>>>>>>>>>> >>>>>>>>>> operators, TUPLE_CLASS attribute needs to be set on each and >>>>>>>> >>>>>>>>> every >>>>>>>>>>> >>>>>>>>>> port >>>>>>>> >>>>>>>> which receives or sends a POJO. >>>>>>>>> >>>>>>>>>> For e.g., if a DAG is like File -> Parser -> Transform -> Dedup >>>>>>>>>>>> >>>>>>>>>>>> -> >>>>>>>>>>> >>>>>>>>>> Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set >>>>>>>> >>>>>>>>> by >>>>>>>>>>> >>>>>>>>>> user >>>>>>> >>>>>>>> on >>>>>>>>> >>>>>>>>>> both input and output ports of transform, dedup operators and >>>>>>>>>>> also >>>>>>>>>>> >>>>>>>>>> on >>>>>>>> >>>>>>>> parser output and formatter input. >>>>>>>>> >>>>>>>>>> The proposal here is to reduce work that is required by user to >>>>>>>>>>>> >>>>>>>>>>>> configure >>>>>>>>>>> the DAG. Technically speaking if an operators knows input >>>>>>>>>>> schema >>>>>>>>>>> >>>>>>>>>> and >>>>>>> >>>>>>>> processing properties, it can determine output schema and >>>>>>>>> >>>>>>>>>> convey >>>>>>>>>>> >>>>>>>>>> it to >>>>>>> >>>>>>>> downstream operators. This way the complete pipeline can be >>>>>>>>> >>>>>>>>>> configured >>>>>>>>>>> >>>>>>>>>> without user setting TUPLE_CLASS or even creating POJOs and >>>>>>>>> >>>>>>>>>> adding >>>>>>>>>>> >>>>>>>>>> them >>>>>>>> >>>>>>>> to >>>>>>>>> >>>>>>>>>> classpath. >>>>>>>>>>>> >>>>>>>>>>>> On the same idea, I want to propose an approach where the >>>>>>>>>>>> >>>>>>>>>>>> pipeline >>>>>>>>>>> >>>>>>>>>> can >>>>>>>> >>>>>>>> be >>>>>>>>> >>>>>>>>>> configured without user setting TUPLE_CLASS or even creating >>>>>>>>>>> POJOs >>>>>>>>>>> >>>>>>>>>> and >>>>>>>> >>>>>>>> adding them to classpath. >>>>>>>>> >>>>>>>>>> Here is the document which at a high level explains the idea >>>>>>>>>>>> >>>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>> a >>>>>>> >>>>>>> high >>>>>>>> >>>>>>>> level design: >>>>>>>>> >>>>>>>>>> https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_ >>>>>>>>>>>> tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing >>>>>>>>>>>> >>>>>>>>>>>> I would like to get opinion from community about feasibility >>>>>>>>>>>> >>>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>> applications of this proposal. >>>>>>> >>>>>>>> Once we get some consensus we can discuss the design in >>>>>>>>>>>> >>>>>>>>>>>> details. >>>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>> >>>>>>>> Chinmay. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >