IMO, it will be good to summarize schema use case and proposed approach to support it on the control tuple e-mail thread. Not everyone interested in the custom control tuple may be following schema support thread.

Thank you,

Vlad

On 2/3/17 08:47, Pramod Immaneni wrote:
On Fri, Feb 3, 2017 at 7:59 AM, Thomas Weise <t...@apache.org> wrote:

Agreed. As noted the main concern was the ability to support idempotency.
It isn't really "re-ordering" because when you have multiple input ports,
there isn't any ordering guarantee within a streaming window.

The reordering I was referring to is the reordering that would happen
either within the container that receives the tuples or the one that sends
the tuples, by holding on to the control tuple(s) till the window boundary
and not the order in which data is received across the different paths.
Also, the idempotency concern is that the operator developer might make a
mistake by making the incorrect ordering assumption, that you mentioned,
and do the wrong thing isn't it? It is not that this approach will break
idempotency or make it not possible to achieve it.

It looks like we have an agreement on this approach so far. Do folks see
the need to summarize this in the control tuple discussion thread as well.
I can do that.

Thanks


The end window boundary is good when the control tuple needs to be
processed after all associated data tuples (which is the case for
watermarks).

For schema it is the opposite, the schema needs to be seen before all data
tuples. The scenario of multiple input ports needs to be considered here as
well.

Thomas


On Thu, Feb 2, 2017 at 9:59 AM, Vlad Rozov <v.ro...@datatorrent.com>
wrote:

I second the proposal to revisit custom control tuple delivery and
re-ordering. Schema support brings a use case that was missing when we
discussed custom control tuples.

Thank you,

Vlad


On 2/1/17 21:56, Pramod Immaneni wrote:

This can be done neatly and possibly completely outside the engine if we
are able to deliver schema information via the control tuple mechanism.
Current control tuple proposal reorders the control tuple to be
delivered
at the end of the window to the operator. This would not be feasible for
schemas as the schema would need to be delivered before the data. If we
can
reconsider this behavior and consider not reordering the control tuple
it
would work in this use case. We can have further discussions on the
scenarios this raises like what to do when there are multiple paths for
data, how control tuples get delivered to unifiers and look into
suggestions like synchronizing on control tuple boundaries and other
ways
to solve these. What do you guys think?

On Wed, Feb 1, 2017 at 8:27 PM, Thomas Weise <t...@apache.org> wrote:

I think dynamic schema would be good to consider (schema known and
possibly
changing at runtime). Some applications cannot be written under the
assumption that the schema is known upfront.

Also, does this really need to leak into the engine? I think it would
be
good to consider alternatives and tradeoffs.

Thomas


On Mon, Jan 30, 2017 at 10:44 PM, Chinmay Kolhatkar <
chin...@datatorrent.com

wrote:
Consumer of output port operator schema is going next downstream

operator.

On Tue, Jan 31, 2017 at 4:01 AM, Sergey Golovko <
ser...@datatorrent.com
wrote:

Sorry, I’m a new person in the APEX team. And I don't understand
clearly
who are consumers of the output port operator schema(s).
1. If the consumers are non-run-time callers like the application

manager
or UI designer, maybe it makes sense to use Java static method(s) to
retrieve the output port operator schema(s). I guess the performance

of a
single call of a static method via reflection can be ignored.
2. If the consumer is next downstream operator, maybe it makes sense
to
send an output port operator schema from upstream operator to next
downstream operator via the stream. The corresponded methods that
would
send and receive the schema should be declared in the
interface/abstract-class of the upstream and downstream operators.
The
sending/receiving of an output schema should be processed right
before
the

sending of the first data record via the stream.

One of examples of a typical implementation for sending of metadata

with
a

regular result set is the sending of JDBC metadata as a part of JDBC

result

set. And I hope the output schema (metadata of the streamed data) in

the
implementation should contain not only a signature of the streamed
objects

(like field names and data types), but also any other properties of
the
data that can be useful by the schema receiver to process the data
(for
instance, a delimiter for CSV record stream).

Thanks,
Sergey

On 2017-01-25 01:47 (-0800), Chinmay Kolhatkar <

chin...@datatorrent.com>
wrote:
Thank you all for the feedback.

I've created a Jira for this: APEXCORE-623 and I'll attach the same
document and link to this mailchain there.

As a first part of this Jira, there are 2 steps I would like to

propose:
1. Add following interface at com.datatorrent.common.util.
SchemaAware.
interface SchemaAware {
Map<OutputPort, Schema> registerSchema(Map<InputPort, Schema>

inputSchema);

}

This interface can be implemented by Operators to communicate its

output
schema(s) to engine.
Input to this schema will be schema at its input port.

2. After LogicalPlan is created call SchemaAware method from
upstream
to
downstream operator in the DAG to propagate the Schema.
Once this is done, changes can be done in Malhar for the operators
in
question.

Please share your opinion on this approach.

Thanks,
Chinmay.




On Wed, Jan 18, 2017 at 2:31 PM, Priyanka Gugale <pri...@apache.org
wrote:

+1 to have this feature.
-Priyanka

On Tue, Jan 17, 2017 at 9:18 PM, Pramod Immaneni <

pra...@datatorrent.com>
wrote:
+1
On Mon, Jan 16, 2017 at 1:23 AM, Chinmay Kolhatkar <

chin...@apache.org>
wrote:
Hi All,
Currently a DAG that is generated by user, if contains any

POJOfied
operators, TUPLE_CLASS attribute needs to be set on each and
every
port

which receives or sends a POJO.
For e.g., if a DAG is like File -> Parser -> Transform -> Dedup

->
Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set
by
user
on
both input and output ports of transform, dedup operators and
also
on

parser output and formatter input.
The proposal here is to reduce work that is required by user to

configure
the DAG. Technically speaking if an operators knows input
schema
and
processing properties, it can determine output schema and
convey
it to
downstream operators. This way the complete pipeline can be
configured
without user setting TUPLE_CLASS or even creating POJOs and
adding
them

to
classpath.

On the same idea, I want to propose an approach where the

pipeline
can

be
configured without user setting TUPLE_CLASS or even creating
POJOs
and

adding them to classpath.
Here is the document which at a high level explains the idea

and
a

high

level design:
https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_
tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing

I would like to get opinion from community about feasibility

and
applications of this proposal.
Once we get some consensus we can discuss the design in

details.
Thanks,
Chinmay.



Reply via email to