Re: Kafka Connect Integration

Christofer Dutz Wed, 29 Aug 2018 12:18:02 -0700

Hi Sagar,

thanks for the Infos ... this way I learn more and more :-)


Looking forward to answering the questions as they come.

Chris

Am 29.08.18, 19:28 schrieb "Sagar" <[email protected]>:

    Hi Chris,
    
    Thanks. Typically kafka cluster will be separate set of nodes. And so would
    be the k-connect workers which will connect to the PLC devices or databases
    or whatever is the source and push to Kafka.
    
    I will start off with this information and extend your feature branch.
    Would keep asking questions along the way
    
    Sagar.
    
    On Wed, Aug 29, 2018 at 7:51 PM Christofer Dutz <[email protected]>
    wrote:
    
    > Hi Sagar,
    >
    > Great that we seem to be on the same page now ;-)
    >
    > Regarding the "Kafka Connecting" ... what I meant is that the
    > Kafca-Connect-PLC4X-Instance connects ... I was assuming the driver to be
    > running on a Kafka Node, but that's just due to my limited knowledge of
    > everything ;-)
    >
    > Well the code for actively reading stuff from a PLC should already be in
    > my example implementation. It should work out of the box this way ... As I
    > have seen several Mock Drivers implemented in PLC4X, I am currently
    > thinking of implementing one that you should be able to just import and 
use
    > ... however I'm currently working hard on refactoring the API completely,
    > so I would postpone that to after these changes are in there. But rest
    > assured ... I would handle the refactoring so you could just assume that 
it
    > works.
    >
    > Alternatively I could have you an account for our IoT VPN created. Then
    > you could log-in to our VPN and talk to some real PLCs ...
    >
    > I think I wanted to create an account for Julian, but my guy responsible
    > for creating them was on holidays ... will re-check this.
    >
    > Chris
    >
    >
    >
    > Am 29.08.18, 16:11 schrieb "Sagar" <[email protected]>:
    >
    >     Hi Chris,
    >
    >     That's perfectly fine :)
    >
    >     So, the way I understand this now is, we will have a bunch of worker
    >     nodes(in kafka connect terminology, a worker is a JVM process which
    > runs a
    >     set of connectors/tasks to poll a source and push data to Kafka).
    >
    >     So, vis-a-vis a JDBC connection, we will have a connection URL which
    > will
    >     let us connect to these PLC devices poll(poll in the sense you meant 
it
    >     above), and then push data to Kafka. If this looks fine, then can you
    > give
    >     me some documentation to refer to and also how can I start testing
    > these?
    >
    >     And just one thing I wanted to clarify when you say Kafka nodes
    > connecting
    >     to devices. That's something which doesn't happen. Kafka doesn't
    > connect to
    >     any device. I think you just mean it in a more abstract way right?
    >
    >     @Julian,
    >
    >     Thanks, I was going through the link you sent. So, you're saying via
    >     scraping, we can push events to Kafka? Is that already happening and 
we
    >     should look to move this functionality out?
    >
    >     Thanks!
    >     Sagar.
    >
    >
    >     On Wed, Aug 29, 2018 at 11:05 AM Julian Feinauer <
    >     [email protected]> wrote:
    >
    >     > Hey Sagar,
    >     >
    >     > hey Chris,
    >     >
    >     >
    >     >
    >     > I want to join your discussion for part b3 as this is something we
    > usually
    >     > require.
    >     >
    >     > We use a module we call the plc-scraper for that task (the term
    > scraping
    >     > in that content is borrowed from Prometheus where it is used
    > extensively in
    >     > this context [1]).
    >     >
    >     > Generally speaking the scraper takes a config containing addresses,
    >     > addresses and scrape rates and runs than as daemon and pushes the
    > scrape
    >     > results downstream (usually Kafka but we also use other "Queues").
    >     >
    >     >
    >     >
    >     > As we are currently rewriting this scraper I already considered
    > donating
    >     > it to plc4x in the form of an example or perhaps even as a 
standalone
    >     > module.
    >     >
    >     >
    >     >
    >     > So I agree with Chris that this should not be part of the PlcDriver
    > Level
    >     > but rather on another layer "on top" and I would be more interested
    > in the
    >     > specification of a "line protocol" which describes how message are
    >     > serialized for Kafka (or other sources).
    >     >
    >     > Can we come up with a common "schema" which fits many use cases?
    >     >
    >     >
    >     >
    >     > Our messages contain the following informations:
    >     >
    >     > - timestamp
    >     >
    >     > - source
    >     >
    >     > - values
    >     >
    >     > - additional tags
    >     >
    >     >
    >     >
    >     > Best
    >     >
    >     > Julian
    >     >
    >     >
    >     >
    >     > [1] https://prometheus.io/docs/prometheus/latest/getting_started/
    >     >
    >     >
    >     >
    >     > Am 28.08.18, 22:53 schrieb "Christofer Dutz" <
    > [email protected]>:
    >     >
    >     >
    >     >
    >     >     Hi Sagar,
    >     >
    >     >
    >     >
    >     >     sorry for not responding ... your mail must have skipped my eye
    > ...
    >     > sorry for that.
    >     >
    >     >
    >     >
    >     >     a) PLC4X works exactly the same way ... it consists of
    > plc4x-core,
    >     > which only contains the DriverManager and plc4x-api which contains
    > the API
    >     > clases. So with these two jars you can build a full PLC4X
    > application.
    >     >
    >     >     In order to connect to a PLC you need to add the jar containing
    > the
    >     > required driver to the classpath.
    >     >
    >     >
    >     >
    >     >     b) Well if using something other than the connection url as
    > partition
    >     > key as partition key, it is possible that multiple kafka connect
    > nodes
    >     > would connect to the same PLC. In that case it could be a problem to
    >     > control the order. I guess using the timestamp when receiving a
    > response
    >     > (Probably generated by the KC plc4x driver) could be a valid
    > approach.
    >     >
    >     >
    >     >
    >     >     b2) Regarding the infinite loop ... I think we won't need such a
    >     > mechanism. If we think of a set of fields from a PLC, we can think
    > of a PLC
    >     > as a one-row database table. Producing diffs should be a lot simpler
    > that
    >     > way.
    >     >
    >     >
    >     >
    >     >     b3) Regarding push events ... PLC4X has a subscription mode next
    > to
    >     > the polling. So it would be possible to also define PLC4X
    > datasources that
    >     > actively push events to kafka ... I have to admit that this would be
    > the
    >     > mode I would prefer most. But as not all protocols and PLCs support
    > this
    >     > mode, I think it would be safest to use polling and to add push
    > support
    >     > after that.
    >     >
    >     >
    >     >
    >     >     Regarding different languages: Currently we are concentrating
    > mainly
    >     > on the Java implementation as it's the biggest challenge to
    > understand and
    >     > implement the protocols. Porting them to other languages (especially
    > C and
    >     > C++) shouldn't be as hard as implementing the first version. But
    > that's
    >     > currently a base uncovered as we don't have the resources to
    > implement all
    >     > of them at once.
    >     >
    >     >
    >     >
    >     >     And there are no stupid questions :-)
    >     >
    >     >
    >     >
    >     >     Hope I could answer all of yours. If not, just ask and I'll
    > probably
    >     > not miss that one ;-)
    >     >
    >     >
    >     >
    >     >     Chris
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >     Am 23.08.18, 19:52 schrieb "Sagar" <[email protected]>:
    >     >
    >     >
    >     >
    >     >         Hi Chirstofer,
    >     >
    >     >
    >     >
    >     >         Thanks for the detailed responses. I would like to ask a
    > couple of
    >     > more
    >     >
    >     >         questions(which may be borderline naive or stupid :D ).
    >     >
    >     >
    >     >
    >     >         First thing that I would like to know- ignore my lack of
    > knowledge
    >     > on PLCs-
    >     >
    >     >         but from what I understand are devices which are small
    > devices
    >     > used to
    >     >
    >     >         execute program instructions. These would have very small
    > memory
    >     > footprints
    >     >
    >     >         as well I believe? Also, when you say the Siemens one can
    > handle 20
    >     >
    >     >         connections, would it be from different devices connecting
    > to it?
    >     > The
    >     >
    >     >         reason I ask these questions are these ->
    >     >
    >     >
    >     >
    >     >         a) The way the kafka-connect framework is executed is by
    >     > installing the
    >     >
    >     >         whole framework with all the relevant jars needed on the
    >     > classpath. So, if
    >     >
    >     >         you talk about the JDBC connector for K-Connect, it would
    > need the
    >     > mysql
    >     >
    >     >         driver jar(for example) and other jars needed to support the
    >     > framework. If
    >     >
    >     >         we say choose to use avro, then we would need more jars to
    > support
    >     > that.
    >     >
    >     >         Would we be able to install all that?
    >     >
    >     >
    >     >
    >     >         b) Also, if multiple devices do connect to it, then won't we
    > have
    >     > events
    >     >
    >     >         arriving out of order from them? Does the ordering matter
    > amongst
    >     > events
    >     >
    >     >         that are being pushed?
    >     >
    >     >
    >     >
    >     >         Regarding the infinite loop question, the reason JDBC
    > connector
    >     > uses that
    >     >
    >     >         is that it creates tasks for a given table and fires queries
    > to
    >     > find
    >     >
    >     >         deltas. So, if the polling frequency is 2 seconds, and it
    > last ran
    >     > on
    >     >
    >     >         12.00.00 then it would run at 12.00.02 to figure out what
    > changed
    >     > in that
    >     >
    >     >         time frame. So, the way PlcReaders read() runs, would it 
keep
    >     > returning
    >     >
    >     >         newer data?
    >     >
    >     >
    >     >
    >     >         We can skip over the rest of the parts, but looking at parts
    > a and
    >     > b above,
    >     >
    >     >         would it make sense to have something like a kafka-connect
    >     > framework for
    >     >
    >     >         pushing data to Kafka? Also, from the github link, the
    > drivers are
    >     > to be
    >     >
    >     >         supported in 3 languages as well. How would that play out?
    >     >
    >     >
    >     >
    >     >         Again- apologies if the questions seem stupid.
    >     >
    >     >
    >     >
    >     >         Thanks!
    >     >
    >     >         Sagar.
    >     >
    >     >
    >     >
    >     >         On Wed, Aug 22, 2018 at 10:39 PM Christofer Dutz <
    >     > [email protected]>
    >     >
    >     >         wrote:
    >     >
    >     >
    >     >
    >     >         > Hi Sagar,
    >     >
    >     >         >
    >     >
    >     >         > great that you managed to have a look ... I'll try to
    > answer your
    >     >
    >     >         > questions.
    >     >
    >     >         > (I like to answer them postfix as whenever emails are sort
    > of
    >     > answered
    >     >
    >     >         > in-line, they are extremely hard to read and follow on
    > mobile
    >     > email clients
    >     >
    >     >         > __ )
    >     >
    >     >         >
    >     >
    >     >         > First of all I created the original plugin via the
    > archetype for
    >     >
    >     >         > kafka-connect plugins. The next thing I did, was to have a
    > look
    >     > at the code
    >     >
    >     >         > of the JDBC Kafka Connect plugin (as you might have
    > guessed) as
    >     > I thought
    >     >
    >     >         > that it would have similar structure as we do.
    > Unfortunately I
    >     > think the
    >     >
    >     >         > JDBC plugin is far more complex than the plc4x connector
    > will
    >     > have to be. I
    >     >
    >     >         > sort of picked some of the things I liked with the
    > archetype and
    >     > some I
    >     >
    >     >         > liked with the jdbc ... if there was a third, even cooler
    > option
    >     > ... I will
    >     >
    >     >         > definitely have missed that. So if you think there is a
    > thing
    >     > worth
    >     >
    >     >         > changing ... you can change anything you like.
    >     >
    >     >         >
    >     >
    >     >         > 1)
    >     >
    >     >         > The code of the jdbc plugin showed such a while(true) 
loop,
    >     > however I
    >     >
    >     >         > think this was because the jdbc query could return a lot
    > of rows
    >     > and hereby
    >     >
    >     >         > Kafka events. In our case we have one request and get one
    >     > response. The
    >     >
    >     >         > code in my example directly calls "get()" on the request
    > and is
    >     > hereby
    >     >
    >     >         > blocking. I don't know if this is good, but from reading
    > the
    >     > jdbc example,
    >     >
    >     >         > this should be blocking too ...
    >     >
    >     >         > So the PlcReaders read() method returns a completable
    > future ...
    >     > this
    >     >
    >     >         > could be completed asynchronously and the callback could
    > fire
    >     > the kafka
    >     >
    >     >         > events, but I didn't know if this was ok with kafka. If it
    > is
    >     > possible,
    >     >
    >     >         > please have a look at this example code:
    >     >
    >     >         >
    >     >
    > 
https://github.com/apache/incubator-plc4x/blob/master/plc4j/protocols/s7/src/test/java/org/apache/plc4x/java/s7/S7PlcReaderSample.java
    >     >
    >     >         > It demonstrates with comments the different usage types.
    >     >
    >     >         >
    >     >
    >     >         > While at it ... is there also an option for a Kafka
    > connector
    >     > that is able
    >     >
    >     >         > to push data? So if an incoming event arrives, this is
    >     > automatically pushed
    >     >
    >     >         > without a fixed polling interval?
    >     >
    >     >         >
    >     >
    >     >         > 2)
    >     >
    >     >         > I have absolutely no idea as I am not quite familiar with
    > the
    >     > concepts
    >     >
    >     >         > inside kafka. What I do know is that probably the
    > partition-key
    >     > should be
    >     >
    >     >         > based upon the connection url. The problem is, that with
    > kafka I
    >     > could have
    >     >
    >     >         > 1000 nodes connecting to one PLC. While Kafka wouldn't 
have
    >     > problems with
    >     >
    >     >         > that, the PLCs have very limited resources. So as far as I
    >     > decoded the
    >     >
    >     >         > responses of my Siemens S7 1200 it can handle up to 20
    >     > connections (Usually
    >     >
    >     >         > a control-system already consuming 2-3 of them) ... I
    > think it
    >     > would be
    >     >
    >     >         > ideal, if on one Kafka node (or partition) there would be
    > one
    >     > PlcConnection
    >     >
    >     >         > ... this connection should then be shared among all
    > requests to
    >     > a PLC with
    >     >
    >     >         > a shared connection url (I hope I'm not writing nonsense).
    > So if
    >     > a
    >     >
    >     >         > workerTask is responsible for managing all request to one
    >     > partition, then
    >     >
    >     >         > I'd say it should be 1 ... otherwise the number could be
    > bigger.
    >     >
    >     >         >
    >     >
    >     >         > If it makes things easier, I'm absolutely fine with using
    > those
    >     >
    >     >         > ConnectorUtils
    >     >
    >     >         >
    >     >
    >     >         > Regarding the connector offsets ... are you referring to
    > that
    >     > counter
    >     >
    >     >         > Kafka uses to let the clients know the sequence of events
    > and
    >     > which they
    >     >
    >     >         > use to sort of say: "Hi, I have number 237367 of topic
    > 'ABC',
    >     > plese
    >     >
    >     >         > continue" ... is that what you are referring to? If it is,
    > well
    >     > ... I have
    >     >
    >     >         > to admit ... I don't know ... ok ... if it isn't then
    > probably
    >     > also ;-)
    >     >
    >     >         > How do other plugins do this?
    >     >
    >     >         >
    >     >
    >     >         > 3)
    >     >
    >     >         > Well I guess both options would be cool ... JSON is
    > definitely
    >     > simpler,
    >     >
    >     >         > but for high volume transports the binary counterparts
    >     > definitely are worth
    >     >
    >     >         > consideration. Currently PLC4X tries to deliver what you
    >     > request, but
    >     >
    >     >         > that's actually something we're currently discussing on
    >     > refactoring. But
    >     >
    >     >         > for the moment - as shown in the example code I referenced
    > a few
    >     > lines
    >     >
    >     >         > above - you do a TypedRequest and for example ask for an
    >     > Integer, then you
    >     >
    >     >         > will receive an array (probably of size 1) of Integers.
    >     >
    >     >         >
    >     >
    >     >         > 4)
    >     >
    >     >         > Well I agree ... well at least I can't even say that I
    > make a
    >     > secret about
    >     >
    >     >         > where I stole things from ;-)
    >     >
    >     >         >
    >     >
    >     >         > If I can be of any assistance ... just ask.
    >     >
    >     >         >
    >     >
    >     >         > Thanks for taking the time.
    >     >
    >     >         >
    >     >
    >     >         > Chris
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    >     >         > Am 22.08.18, 17:55 schrieb "Sagar" <
    > [email protected]>:
    >     >
    >     >         >
    >     >
    >     >         >     Hi All,
    >     >
    >     >         >
    >     >
    >     >         >     I was going through the K-Connect stubs created by
    > Chris in
    >     > the kafka
    >     >
    >     >         >     feature branch.
    >     >
    >     >         >
    >     >
    >     >         >     Some of the findings I found are here(let me know if
    > they
    >     > are valid or
    >     >
    >     >         > not):
    >     >
    >     >         >
    >     >
    >     >         >     1)
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    > 
https://github.com/apache/incubator-plc4x/blob/feature/apache-kafka/integrations/apache-kafka/src/main/java/org/apache/plc4x/kafka/source/Plc4xSourceTask.java#L98
    >     >
    >     >         >
    >     >
    >     >         >     Should this block of code be within an infinite loop
    > like
    >     > while(true)?
    >     >
    >     >         > I am
    >     >
    >     >         >     not exactly sure of the semantics of the PlcReader
    > hence
    >     > asking this
    >     >
    >     >         >     question.
    >     >
    >     >         >
    >     >
    >     >         >     2) Another question is, what are the maxTasks that we
    >     > envision here?
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    > 
https://github.com/apache/incubator-plc4x/blob/feature/apache-kafka/integrations/apache-kafka/src/main/java/org/apache/plc4x/kafka/Plc4xSourceConnector.java#L46
    >     >
    >     >         >
    >     >
    >     >         >     Also, as part of documentation, there's a utility
    > called
    >     > ConnectorUtils
    >     >
    >     >         >     which typically should be used to create the
    > configs(not a
    >     > hard and
    >     >
    >     >         > fast
    >     >
    >     >         >     rule though):
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    > 
https://docs.confluent.io/current/connect/javadocs/index.html?org/apache/kafka/connect/util/ConnectorUtils.html
    >     >
    >     >         >
    >     >
    >     >         >     If we go that route, then we also need to specify how
    > the
    >     > offsets
    >     >
    >     >         > would be
    >     >
    >     >         >     stored in the offsets topic(by using the task name).
    > So, if
    >     > it can be
    >     >
    >     >         >     figured out as to how would the connectors be setup,
    > then
    >     > that'll be
    >     >
    >     >         >     helpful.
    >     >
    >     >         >
    >     >
    >     >         >     3) While building the SourceRecord ->
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    > 
https://github.com/apache/incubator-plc4x/blob/feature/apache-kafka/integrations/apache-kafka/src/main/java/org/apache/plc4x/kafka/source/Plc4xSourceTask.java#L109
    >     >
    >     >         >
    >     >
    >     >         >     , we would also need some DataConverter layer to have
    > them
    >     > mapped to
    >     >
    >     >         > the
    >     >
    >     >         >     connect types. Also, which message types would be
    > supported?
    >     > Json or
    >     >
    >     >         > binary
    >     >
    >     >         >     protocols like Avro/protobuf etc or some other
    > protocols?
    >     > Those things
    >     >
    >     >         >     might also need to be factored in.
    >     >
    >     >         >
    >     >
    >     >         >     4) Lastly, need to remove the JdbcSourceTask from the
    > catch
    >     > block here
    >     >
    >     >         > :) ->
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    > 
https://github.com/apache/incubator-plc4x/blob/feature/apache-kafka/integrations/apache-kafka/src/main/java/org/apache/plc4x/kafka/source/Plc4xSourceTask.java#L67
    >     >
    >     >         >
    >     >
    >     >         >     Thanks!
    >     >
    >     >         >     Sagar.
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    >     >         >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >
    >
    >

Re: Kafka Connect Integration

Reply via email to