Re: Scaling source processors in nifi horizontally.
I initially thought you were saying that you had 250 Avro schemas that you had to use, as in 250 different distinct data models. Maybe someone else has a suggestion on how to do it, but I think this may just be a fundamental problem of having that many different databases in MySQL and trying to do CDC with them. Is there a hard business requirement to segregate data like that or some factor like pulling from many remote databases that is at play here? On Thu, Oct 18, 2018 at 6:19 AM ashwin konale wrote: > Hi, > > The flow is like this, > > MysqlCDC -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS) > > But we have around 250 schemas to pull data from, So with clustering setup, > > MysqlCDC_schema1 -> RPG > MysqlCDC_schema2 -> RPG > MysqlCDC_schema3 -> RPG and so on > > InputPort -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS) > > But MysqlCDC can run only in primary node in the cluster, I will end up > running all of input processors in single node. This can easily become > bottleneck with increasing number of schemas we have. Could you suggest me > any alternative approach to this problem. > > On 2018/10/17 21:14:09, Mike Thomsen wrote: > > > may have to build some kind of tooling on top of it to > monitor/provision> > > new processor for newly added schemas etc.> > > > > Could you elaborate on this part of your use case?> > > > > On Wed, Oct 17, 2018 at 2:31 PM ashwin konale > > > wrote:> > > > > > Hi,> > > >> > > > I am experimenting with nifi for one of our usecases with plans of> > > > extending it to various other data routing, ingestion usecases. Right > now I> > > > need to ingest data from mysql binlogs to hdfs/GCS. We have around 250> > > > different schemas and about 3000 tables to read data from. Volume of > the> > > > data flow ranges from 500 - 2000 messages per second in different > schemas.> > > >> > > > Right now the problem is mysqlCDC processor can run in only one thread. > To> > > > overcome this issue I have two options.> > > >> > > > 1. Use primary node execution, so different processors for each of the> > > > schemas. So eventually all processors which reads from mysql will run > in> > > > single node, which will be a bottleneck no matter how big my nifi > cluster> > > > is.> > > >> > > > 2. Another approach is to use multiple nifi instances to pull data and > have> > > > master nifi cluster for ingestion to various sinks. In this approach I > will> > > > have to manage all these small nifi instances, and may have to build > some> > > > kind of tooling on top of it to monitor/provision new processor for > newly> > > > added schemas etc.> > > >> > > > Is there any better way to achieve my usecase with nifi ? Please advice > me> > > > on the architechture.> > > >> > > > Looking forward for suggestion.> > > >> > > > - Ashwin> > > >> > > >
Re: Scaling source processors in nifi horizontally.
Hi, The flow is like this, MysqlCDC -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS) But we have around 250 schemas to pull data from, So with clustering setup, MysqlCDC_schema1 -> RPG MysqlCDC_schema2 -> RPG MysqlCDC_schema3 -> RPG and so on InputPort -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS) But MysqlCDC can run only in primary node in the cluster, I will end up running all of input processors in single node. This can easily become bottleneck with increasing number of schemas we have. Could you suggest me any alternative approach to this problem. On 2018/10/17 21:14:09, Mike Thomsen wrote: > > may have to build some kind of tooling on top of it to monitor/provision> > new processor for newly added schemas etc.> > > Could you elaborate on this part of your use case?> > > On Wed, Oct 17, 2018 at 2:31 PM ashwin konale > > wrote:> > > > Hi,> > >> > > I am experimenting with nifi for one of our usecases with plans of> > > extending it to various other data routing, ingestion usecases. Right now I> > > need to ingest data from mysql binlogs to hdfs/GCS. We have around 250> > > different schemas and about 3000 tables to read data from. Volume of the> > > data flow ranges from 500 - 2000 messages per second in different schemas.> > >> > > Right now the problem is mysqlCDC processor can run in only one thread. To> > > overcome this issue I have two options.> > >> > > 1. Use primary node execution, so different processors for each of the> > > schemas. So eventually all processors which reads from mysql will run in> > > single node, which will be a bottleneck no matter how big my nifi cluster> > > is.> > >> > > 2. Another approach is to use multiple nifi instances to pull data and have> > > master nifi cluster for ingestion to various sinks. In this approach I will> > > have to manage all these small nifi instances, and may have to build some> > > kind of tooling on top of it to monitor/provision new processor for newly> > > added schemas etc.> > >> > > Is there any better way to achieve my usecase with nifi ? Please advice me> > > on the architechture.> > >> > > Looking forward for suggestion.> > >> > > - Ashwin> > >> >
Re: Scaling source processors in nifi horizontally.
Hi, The flow is like this, MysqlCDC -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS) But we have around 250 schemas to pull data from, So with clustering setup, MysqlCDC_schema1 -> RPG MysqlCDC_schema2 -> RPG MysqlCDC_schema3 -> RPG and so on InputPort -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS) But MysqlCDC can run only in primary node in the cluster, I will end up running all of input processors in single node. This can easily become bottleneck with increasing number of schemas we have. Could you suggest me any alternative approach to this problem. On 2018/10/17 21:14:09, Mike Thomsen wrote: > > may have to build some kind of tooling on top of it to monitor/provision> > new processor for newly added schemas etc.> > > Could you elaborate on this part of your use case?> > > On Wed, Oct 17, 2018 at 2:31 PM ashwin konale > > wrote:> > > > Hi,> > >> > > I am experimenting with nifi for one of our usecases with plans of> > > extending it to various other data routing, ingestion usecases. Right now > > I> > > need to ingest data from mysql binlogs to hdfs/GCS. We have around 250> > > different schemas and about 3000 tables to read data from. Volume of the> > > data flow ranges from 500 - 2000 messages per second in different schemas.> > >> > > Right now the problem is mysqlCDC processor can run in only one thread. To> > > overcome this issue I have two options.> > >> > > 1. Use primary node execution, so different processors for each of the> > > schemas. So eventually all processors which reads from mysql will run in> > > single node, which will be a bottleneck no matter how big my nifi cluster> > > is.> > >> > > 2. Another approach is to use multiple nifi instances to pull data and > > have> > > master nifi cluster for ingestion to various sinks. In this approach I > > will> > > have to manage all these small nifi instances, and may have to build some> > > kind of tooling on top of it to monitor/provision new processor for newly> > > added schemas etc.> > >> > > Is there any better way to achieve my usecase with nifi ? Please advice me> > > on the architechture.> > >> > > Looking forward for suggestion.> > >> > > - Ashwin> > >> >
Re: Scaling source processors in nifi horizontally.
> may have to build some kind of tooling on top of it to monitor/provision new processor for newly added schemas etc. Could you elaborate on this part of your use case? On Wed, Oct 17, 2018 at 2:31 PM ashwin konale wrote: > Hi, > > I am experimenting with nifi for one of our usecases with plans of > extending it to various other data routing, ingestion usecases. Right now I > need to ingest data from mysql binlogs to hdfs/GCS. We have around 250 > different schemas and about 3000 tables to read data from. Volume of the > data flow ranges from 500 - 2000 messages per second in different schemas. > > Right now the problem is mysqlCDC processor can run in only one thread. To > overcome this issue I have two options. > > 1. Use primary node execution, so different processors for each of the > schemas. So eventually all processors which reads from mysql will run in > single node, which will be a bottleneck no matter how big my nifi cluster > is. > > 2. Another approach is to use multiple nifi instances to pull data and have > master nifi cluster for ingestion to various sinks. In this approach I will > have to manage all these small nifi instances, and may have to build some > kind of tooling on top of it to monitor/provision new processor for newly > added schemas etc. > > Is there any better way to achieve my usecase with nifi ? Please advice me > on the architechture. > > Looking forward for suggestion. > > - Ashwin >
Scaling source processors in nifi horizontally.
Hi, I am experimenting with nifi for one of our usecases with plans of extending it to various other data routing, ingestion usecases. Right now I need to ingest data from mysql binlogs to hdfs/GCS. We have around 250 different schemas and about 3000 tables to read data from. Volume of the data flow ranges from 500 - 2000 messages per second in different schemas. Right now the problem is mysqlCDC processor can run in only one thread. To overcome this issue I have two options. 1. Use primary node execution, so different processors for each of the schemas. So eventually all processors which reads from mysql will run in single node, which will be a bottleneck no matter how big my nifi cluster is. 2. Another approach is to use multiple nifi instances to pull data and have master nifi cluster for ingestion to various sinks. In this approach I will have to manage all these small nifi instances, and may have to build some kind of tooling on top of it to monitor/provision new processor for newly added schemas etc. Is there any better way to achieve my usecase with nifi ? Please advice me on the architechture. Looking forward for suggestion. - Ashwin