Re: Source Connector Task in a distributed env

2019-04-24 Thread Venkata S A
Thank you Ryann & Hans. I will look into it.
The spooldir, I explored it too and found that it too suits for standalone
as you mentioned.

'Venkata

On Wed 24 Apr, 2019, 22:34 Hans Jespersen,  wrote:

> Your connector sounds a lot like this one
> https://github.com/jcustenborder/kafka-connect-spooldir
>
> I do not think you can run such a connector in distributed mode though.
> Typically something like this runs in standalone mode to avoid conflicts.
>
> -hans
>
>
> On Wed, Apr 24, 2019 at 1:08 AM Venkata S A  wrote:
>
> > Hello Team,
> >
> >   I am developing a custom Source Connector that watches a
> > given directory for any new files. My question is in a Distributed
> > environment, how will the tasks in different nodes handle the file Queue?
> >
> >   Referring to this sample
> > <
> >
> https://github.com/DataReply/kafka-connect-directory-source/tree/master/src/main/java/org/apache/kafka/connect/directory
> > >
> > ,
> > poll() in SourceTask is polling the directory at specified interval for a
> > new files and fetching the files in a Queue as below:
> >
> > Queue queue = ((DirWatcher) task).getFilesQueue();
> > >
> >
> > So, When in a 3 node cluster, this is run individually by each task. But
> > then, How is the synchronization happening between all the tasks in
> > different nodes to avoid duplication of file reading to kafka ?
> >
> >
> > Thank you,
> > Venkata S
> >
>


Re: Source Connector Task in a distributed env

2019-04-24 Thread Hans Jespersen
Your connector sounds a lot like this one
https://github.com/jcustenborder/kafka-connect-spooldir

I do not think you can run such a connector in distributed mode though.
Typically something like this runs in standalone mode to avoid conflicts.

-hans


On Wed, Apr 24, 2019 at 1:08 AM Venkata S A  wrote:

> Hello Team,
>
>   I am developing a custom Source Connector that watches a
> given directory for any new files. My question is in a Distributed
> environment, how will the tasks in different nodes handle the file Queue?
>
>   Referring to this sample
> <
> https://github.com/DataReply/kafka-connect-directory-source/tree/master/src/main/java/org/apache/kafka/connect/directory
> >
> ,
> poll() in SourceTask is polling the directory at specified interval for a
> new files and fetching the files in a Queue as below:
>
> Queue queue = ((DirWatcher) task).getFilesQueue();
> >
>
> So, When in a 3 node cluster, this is run individually by each task. But
> then, How is the synchronization happening between all the tasks in
> different nodes to avoid duplication of file reading to kafka ?
>
>
> Thank you,
> Venkata S
>


Re: Source Connector Task in a distributed env

2019-04-24 Thread Ryanne Dolan
Venkata, the example you have linked creates a single task config s.t.
there is no parallelism -- a single task runs on the cluster, regardless of
the number of nodes. In order to introduce parallelism, your
SourceConnector needs to group all known files among N partitions and
return N task configs for N tasks. You can use
ConnectorUtils.groupPartitions() for this. In each task config, specify the
specific group of files for that task, as grouped by groupPartitions().

Then your SourceConnector can watch for new files. Anytime a new file is
detected, call context.requestTaskReconfiguration(), which will restart
this process.

Ryanne

On Wed, Apr 24, 2019 at 3:08 AM Venkata S A  wrote:

> Hello Team,
>
>   I am developing a custom Source Connector that watches a
> given directory for any new files. My question is in a Distributed
> environment, how will the tasks in different nodes handle the file Queue?
>
>   Referring to this sample
> <
> https://github.com/DataReply/kafka-connect-directory-source/tree/master/src/main/java/org/apache/kafka/connect/directory
> >
> ,
> poll() in SourceTask is polling the directory at specified interval for a
> new files and fetching the files in a Queue as below:
>
> Queue queue = ((DirWatcher) task).getFilesQueue();
> >
>
> So, When in a 3 node cluster, this is run individually by each task. But
> then, How is the synchronization happening between all the tasks in
> different nodes to avoid duplication of file reading to kafka ?
>
>
> Thank you,
> Venkata S
>


Source Connector Task in a distributed env

2019-04-24 Thread Venkata S A
Hello Team,

  I am developing a custom Source Connector that watches a
given directory for any new files. My question is in a Distributed
environment, how will the tasks in different nodes handle the file Queue?

  Referring to this sample

,
poll() in SourceTask is polling the directory at specified interval for a
new files and fetching the files in a Queue as below:

Queue queue = ((DirWatcher) task).getFilesQueue();
>

So, When in a 3 node cluster, this is run individually by each task. But
then, How is the synchronization happening between all the tasks in
different nodes to avoid duplication of file reading to kafka ?


Thank you,
Venkata S