Re: [discuss] NiFi support for Hadoop ecosystem components

2023-04-12 Thread Jim Halfpenny
Hi all,
Late to the party, but as a vendor I favour option 2 and not to abandon support 
for the Hadoop NARs. There is a very long tail of folks using NiFi on Hadoop 
and I’d sooner give them the opportunity to keep using and upgrading NiFi while 
they decide what to do with their existing data platforms.

Keeping the convenience build to a minimal size is a worthy goal and providing 
a separate binary repo for NAR files would be ideal. For users creating Docker 
images for NiFi it would be easy to incorporate the additional plugins that no 
longer appear in the convenience build.

Kind regards,
Jim


> On 24 Mar 2023, at 18:07, Joe Witt  wrote:
> 
> Team,
> 
> For the full time NiFi has been in Apache we've built with support for
> various Hadoop ecosystem components like HDFS, Hive, HBase, others,
> and more recently formats/serialization modes like necessary for
> Parquet, Orc, Iceberg, etc..
> 
> All of these things however present endless challenges with
> compatibility across different versions (Hive being the most difficult
> by far), vendors (hadoop vendors, cloud vendors, etc..).  And also
> super notably the incredible number of dependencies, dependency
> conflicts, inclusions/exclusions, old log libs, vulnerability updates,
> etc..  And last but certainly not least a big reason why our build has
> grown so much.
> 
> We have a couple options:
> 1. Deprecate these components in NiFi 1.x and drop them entirely in
> NiFi 2.x.  Leave this as a problem for vendors to deal with.  NiFi
> users interacting with such components are nearly exclusively doing so
> with vendors anyway.
> 
> 2. Remove the components from NiFi main code line and create a
> separate repo for 'nifi-hadoop-extensions'.  We manage those
> independently and release them periodically.  They would be available
> for people to grab the nars if they want to use them.  We include none
> of them in the convenience binary going forward by default.
> 
> 3. Change nothing.  Continue to battle with the above listed items.
> This is admittedly a bit of a non-option.  We can't keep spending the
> same time/energy on these we have.  It is a very small number of
> people that fight this battle.
> 
> Look forward to hearing thoughts on these options or others we might consider.
> 
> Thanks



Re: [discuss] NiFi support for Hadoop ecosystem components

2023-03-29 Thread Joe Witt
Given the discussion my sense is that clean option 1 is not really
realistic.  We're probably looking more likely at option 2 and just
have a far less frequent release cycle for them and thus lower
overhead.  My biggest gripe is the constant workarounds we have in all
these poms to wrangle dependencies and include/exclude simply because
those projects do not maintain them nearly with the speed/focus that
we do.

Thanks

On Wed, Mar 29, 2023 at 1:07 PM Steven Matison  wrote:
>
> I waited to respond becuase I wanted to see the conversation before jumping
> in.
>
> As a user, fan, and consultant I am super excited to even know context
> around nifi 2.0 and entire modernization effort that is already on going to
> end of 1.x. I am also thankful to be working on some of it myself too.
> This is an amazing time for nifi.
>
> In respect of anyone using "old nifi" this topic is super concerning, but
> to me this is not more concerning than why you are still using older
> versions and/or why are you needing deep backward compatibility with new
> versions that are several years from where you are.  Are you not also
> modernizing?
>
> Those things aside, my choice is #1.  I speak for those using current
> version all way back to some of oldest versions you could imagine are still
> online.These people will continue to operate nifi the way they need it
> in those versions.  If and when they decide to take a new version, they
> will, and they will also deal with the inherit challenges to modernize.
> Many of them work completely outside of community and will do that
> themselves or rely on a vendor who can support that path for them.
>
>
>
>
>
> On Wed, Mar 29, 2023 at 11:03 AM Chakravarty, G  wrote:
>
> > One of our primary reasons for using Nifi is that it plays nicely with
> > connecting with on-prem HDFS/Hive/Kudu data stores. Also, it appears that
> > although the on-prem hadoop/hive tech stack is somewhat less popular now,
> > the same hdfs/hive technology is appearing in the cloud under different
> > names: Google Dataproc, AWS EMR, Azure HDinsight, Iceberg etc. Some type of
> > generic components where the Hadoop processors connectivity to Nifi is
> > maintained while individual vendors maintain their own connectivity to
> > their products will be a good option if possible.
> >
> > GC
> >
> > ________
> > From: Isha Lamboo 
> > Sent: Monday, March 27, 2023 9:04 AM
> > To: dev@nifi.apache.org 
> > Subject: RE: [discuss] NiFi support for Hadoop ecosystem components
> >
> > From the perspective of a NiFi administrator:
> >
> > Removing the xxxHDFS processors anytime soon (2.0) would be a huge issue
> > for us. It shouldn't be, the last Hadoop cluster in our environments was
> > shut down earlier this year. Hive was already gone more than a year ago.
> > But we still have 1000+ HDFS processors in use to manage the Azure
> > Datalake. Azure-specific processors have been available for a while, but
> > there was no business case to migrating solutions that were working fine.
> >
> > Getting the required development time/budget to migrate all those flows to
> > the Azure processors doesn't look very realistic. This would have to be a
> > gradual "replace when you need to change and test the flow anyway" affair.
> > Until that finishes, we'd be stuck on the 1.x branch since we're not using
> > vendor support.
> >
> > Option #2 would be vastly preferable to #1 for this simple and dumb reason.
> >
> > Disregarding our technical debt issues, I agree that it makes sense for
> > NiFi instances with a lot of Hadoop integration to depend on vendors for
> > their specific flavor of Hadoop, while core NiFi moves forward without all
> > of that complexity.
> >
> > Regards,
> >
> > Isha
> >
> > -Oorspronkelijk bericht-
> > Van: Nandor Soma Abonyi 
> > Verzonden: maandag 27 maart 2023 12:31
> > Aan: dev@nifi.apache.org
> > Onderwerp: Re: [discuss] NiFi support for Hadoop ecosystem components
> >
> > Thank you for raising this topic, Joe!
> >
> > While I understand the desire to remove Hadoop components, I have mixed
> > feelings about removing one of the core parts of the Big Data world from
> > the project. I'm unsure for how many users we could make a hard time
> > removing those components. It seems to be a too significant shift in our
> > philosophy.
> > We can already see in the above example that somebody would not use NiFi
> > if we'd removed them.
> >
> > Furthermore, although Hadoop has been buried multiple times, new
> 

Re: [discuss] NiFi support for Hadoop ecosystem components

2023-03-29 Thread Steven Matison
I waited to respond becuase I wanted to see the conversation before jumping
in.

As a user, fan, and consultant I am super excited to even know context
around nifi 2.0 and entire modernization effort that is already on going to
end of 1.x. I am also thankful to be working on some of it myself too.
This is an amazing time for nifi.

In respect of anyone using "old nifi" this topic is super concerning, but
to me this is not more concerning than why you are still using older
versions and/or why are you needing deep backward compatibility with new
versions that are several years from where you are.  Are you not also
modernizing?

Those things aside, my choice is #1.  I speak for those using current
version all way back to some of oldest versions you could imagine are still
online.These people will continue to operate nifi the way they need it
in those versions.  If and when they decide to take a new version, they
will, and they will also deal with the inherit challenges to modernize.
Many of them work completely outside of community and will do that
themselves or rely on a vendor who can support that path for them.





On Wed, Mar 29, 2023 at 11:03 AM Chakravarty, G  wrote:

> One of our primary reasons for using Nifi is that it plays nicely with
> connecting with on-prem HDFS/Hive/Kudu data stores. Also, it appears that
> although the on-prem hadoop/hive tech stack is somewhat less popular now,
> the same hdfs/hive technology is appearing in the cloud under different
> names: Google Dataproc, AWS EMR, Azure HDinsight, Iceberg etc. Some type of
> generic components where the Hadoop processors connectivity to Nifi is
> maintained while individual vendors maintain their own connectivity to
> their products will be a good option if possible.
>
> GC
>
> 
> From: Isha Lamboo 
> Sent: Monday, March 27, 2023 9:04 AM
> To: dev@nifi.apache.org 
> Subject: RE: [discuss] NiFi support for Hadoop ecosystem components
>
> From the perspective of a NiFi administrator:
>
> Removing the xxxHDFS processors anytime soon (2.0) would be a huge issue
> for us. It shouldn't be, the last Hadoop cluster in our environments was
> shut down earlier this year. Hive was already gone more than a year ago.
> But we still have 1000+ HDFS processors in use to manage the Azure
> Datalake. Azure-specific processors have been available for a while, but
> there was no business case to migrating solutions that were working fine.
>
> Getting the required development time/budget to migrate all those flows to
> the Azure processors doesn't look very realistic. This would have to be a
> gradual "replace when you need to change and test the flow anyway" affair.
> Until that finishes, we'd be stuck on the 1.x branch since we're not using
> vendor support.
>
> Option #2 would be vastly preferable to #1 for this simple and dumb reason.
>
> Disregarding our technical debt issues, I agree that it makes sense for
> NiFi instances with a lot of Hadoop integration to depend on vendors for
> their specific flavor of Hadoop, while core NiFi moves forward without all
> of that complexity.
>
> Regards,
>
> Isha
>
> -----Oorspronkelijk bericht-
> Van: Nandor Soma Abonyi 
> Verzonden: maandag 27 maart 2023 12:31
> Aan: dev@nifi.apache.org
> Onderwerp: Re: [discuss] NiFi support for Hadoop ecosystem components
>
> Thank you for raising this topic, Joe!
>
> While I understand the desire to remove Hadoop components, I have mixed
> feelings about removing one of the core parts of the Big Data world from
> the project. I'm unsure for how many users we could make a hard time
> removing those components. It seems to be a too significant shift in our
> philosophy.
> We can already see in the above example that somebody would not use NiFi
> if we'd removed them.
>
> Furthermore, although Hadoop has been buried multiple times, new
> technologies seem to still depend on it. For example, Iceberg, in which
> case I'm worried about the consequences of removing the support for an
> increasingly popular technology.
>
> So I wonder whether it is possible to find a forward-looking solution that
> could serve all projects. I've always found configuring Hadoop and friends
> too tricky and I thought it was primarily for historical reasons. The
> issues you describe could easily result from such a thing. I assume that
> over time, new and new things have been added on top of the existing
> implementation without significant refactoring.
>
> My - probably utopistic - idea would be to contact the Hadoop and Hive
> teams and share the issues we are dealing with. Probably we are not alone
> in these problems, but I don't know whether they are aware of them. Even if
> they are, I think approaching them is w

Re: [discuss] NiFi support for Hadoop ecosystem components

2023-03-29 Thread Chakravarty, G
One of our primary reasons for using Nifi is that it plays nicely with 
connecting with on-prem HDFS/Hive/Kudu data stores. Also, it appears that 
although the on-prem hadoop/hive tech stack is somewhat less popular now, the 
same hdfs/hive technology is appearing in the cloud under different names: 
Google Dataproc, AWS EMR, Azure HDinsight, Iceberg etc. Some type of generic 
components where the Hadoop processors connectivity to Nifi is maintained while 
individual vendors maintain their own connectivity to their products will be a 
good option if possible.

GC


From: Isha Lamboo 
Sent: Monday, March 27, 2023 9:04 AM
To: dev@nifi.apache.org 
Subject: RE: [discuss] NiFi support for Hadoop ecosystem components

From the perspective of a NiFi administrator:

Removing the xxxHDFS processors anytime soon (2.0) would be a huge issue for 
us. It shouldn't be, the last Hadoop cluster in our environments was shut down 
earlier this year. Hive was already gone more than a year ago. But we still 
have 1000+ HDFS processors in use to manage the Azure Datalake. Azure-specific 
processors have been available for a while, but there was no business case to 
migrating solutions that were working fine.

Getting the required development time/budget to migrate all those flows to the 
Azure processors doesn't look very realistic. This would have to be a gradual 
"replace when you need to change and test the flow anyway" affair. Until that 
finishes, we'd be stuck on the 1.x branch since we're not using vendor support.

Option #2 would be vastly preferable to #1 for this simple and dumb reason.

Disregarding our technical debt issues, I agree that it makes sense for NiFi 
instances with a lot of Hadoop integration to depend on vendors for their 
specific flavor of Hadoop, while core NiFi moves forward without all of that 
complexity.

Regards,

Isha

-Oorspronkelijk bericht-
Van: Nandor Soma Abonyi 
Verzonden: maandag 27 maart 2023 12:31
Aan: dev@nifi.apache.org
Onderwerp: Re: [discuss] NiFi support for Hadoop ecosystem components

Thank you for raising this topic, Joe!

While I understand the desire to remove Hadoop components, I have mixed 
feelings about removing one of the core parts of the Big Data world from the 
project. I'm unsure for how many users we could make a hard time removing those 
components. It seems to be a too significant shift in our philosophy.
We can already see in the above example that somebody would not use NiFi if 
we'd removed them.

Furthermore, although Hadoop has been buried multiple times, new technologies 
seem to still depend on it. For example, Iceberg, in which case I'm worried 
about the consequences of removing the support for an increasingly popular 
technology.

So I wonder whether it is possible to find a forward-looking solution that 
could serve all projects. I've always found configuring Hadoop and friends too 
tricky and I thought it was primarily for historical reasons. The issues you 
describe could easily result from such a thing. I assume that over time, new 
and new things have been added on top of the existing implementation without 
significant refactoring.

My - probably utopistic - idea would be to contact the Hadoop and Hive teams 
and share the issues we are dealing with. Probably we are not alone in these 
problems, but I don't know whether they are aware of them. Even if they are, I 
think approaching them is worth the chance. Who knows where we will end up if 
somebody representing the NiFi project does that?

Regards,
Nandor Soma Abonyi


> On Mar 24, 2023, at 10:40 PM, Jeremy Dyer  wrote:
>
> I think option 2 is the best way to handle this.
>
> Technology naturally changes over time and some components of Nifi might not 
> make the most sense to keep around in the main line for the masses anymore. 
> However I really like still having them there for people to very simply add 
> if they so desire too. I see other platforms do this by adding a “contrib” 
> repo. What if we had something like a “nifi-contrib” or “nifi-emeritus” repo 
> in GitHub, Apache GitHub repo, where the community can still be involved as 
> desired but also keep things readily available to those who might not even be 
> heavily involved in the community?
>
> I even see this as a sustainable pattern for any components that need “moved 
> out”.
>
> I wouldn’t even think those components in the “contrib” repo would require 
> voting on for releases but someone, or a vendor, could update them via PRs 
> after the official release.
>
> Jeremy Dyer
>
> Get Outlook for
> iOS<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> aka.ms%2Fo0ukef=05%7C01%7Cisha.lamboo%40virtualsciences.nl%7C7a74
> 6c107132419b7ec808db2eae6b08%7C21429da9e4ad45f99a6fcd126a64274b%7C0%7C
> 0%7C638155098878698642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzI

RE: [discuss] NiFi support for Hadoop ecosystem components

2023-03-27 Thread Isha Lamboo
From the perspective of a NiFi administrator:

Removing the xxxHDFS processors anytime soon (2.0) would be a huge issue for 
us. It shouldn't be, the last Hadoop cluster in our environments was shut down 
earlier this year. Hive was already gone more than a year ago. But we still 
have 1000+ HDFS processors in use to manage the Azure Datalake. Azure-specific 
processors have been available for a while, but there was no business case to 
migrating solutions that were working fine.

Getting the required development time/budget to migrate all those flows to the 
Azure processors doesn't look very realistic. This would have to be a gradual 
"replace when you need to change and test the flow anyway" affair. Until that 
finishes, we'd be stuck on the 1.x branch since we're not using vendor support.

Option #2 would be vastly preferable to #1 for this simple and dumb reason. 

Disregarding our technical debt issues, I agree that it makes sense for NiFi 
instances with a lot of Hadoop integration to depend on vendors for their 
specific flavor of Hadoop, while core NiFi moves forward without all of that 
complexity. 

Regards,

Isha

-Oorspronkelijk bericht-
Van: Nandor Soma Abonyi  
Verzonden: maandag 27 maart 2023 12:31
Aan: dev@nifi.apache.org
Onderwerp: Re: [discuss] NiFi support for Hadoop ecosystem components

Thank you for raising this topic, Joe!

While I understand the desire to remove Hadoop components, I have mixed 
feelings about removing one of the core parts of the Big Data world from the 
project. I'm unsure for how many users we could make a hard time removing those 
components. It seems to be a too significant shift in our philosophy.
We can already see in the above example that somebody would not use NiFi if 
we'd removed them.

Furthermore, although Hadoop has been buried multiple times, new technologies 
seem to still depend on it. For example, Iceberg, in which case I'm worried 
about the consequences of removing the support for an increasingly popular 
technology.

So I wonder whether it is possible to find a forward-looking solution that 
could serve all projects. I've always found configuring Hadoop and friends too 
tricky and I thought it was primarily for historical reasons. The issues you 
describe could easily result from such a thing. I assume that over time, new 
and new things have been added on top of the existing implementation without 
significant refactoring.

My - probably utopistic - idea would be to contact the Hadoop and Hive teams 
and share the issues we are dealing with. Probably we are not alone in these 
problems, but I don't know whether they are aware of them. Even if they are, I 
think approaching them is worth the chance. Who knows where we will end up if 
somebody representing the NiFi project does that?

Regards,
Nandor Soma Abonyi


> On Mar 24, 2023, at 10:40 PM, Jeremy Dyer  wrote:
> 
> I think option 2 is the best way to handle this.
> 
> Technology naturally changes over time and some components of Nifi might not 
> make the most sense to keep around in the main line for the masses anymore. 
> However I really like still having them there for people to very simply add 
> if they so desire too. I see other platforms do this by adding a “contrib” 
> repo. What if we had something like a “nifi-contrib” or “nifi-emeritus” repo 
> in GitHub, Apache GitHub repo, where the community can still be involved as 
> desired but also keep things readily available to those who might not even be 
> heavily involved in the community?
> 
> I even see this as a sustainable pattern for any components that need “moved 
> out”.
> 
> I wouldn’t even think those components in the “contrib” repo would require 
> voting on for releases but someone, or a vendor, could update them via PRs 
> after the official release.
> 
> Jeremy Dyer
> 
> Get Outlook for 
> iOS<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> aka.ms%2Fo0ukef=05%7C01%7Cisha.lamboo%40virtualsciences.nl%7C7a74
> 6c107132419b7ec808db2eae6b08%7C21429da9e4ad45f99a6fcd126a64274b%7C0%7C
> 0%7C638155098878698642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=xXRy
> LdqqQND5lG1MaBEonKblKwlpmMdKvOH34FouBPI%3D=0>
> 
> From: Chakravarty, G 
> Sent: Friday, March 24, 2023 4:36:43 PM
> To: dev@nifi.apache.org 
> Subject: Re: [discuss] NiFi support for Hadoop ecosystem components
> 
> I am wondering if the standard Nifi jdbc/odbc processors with some basic 
> testing with the common drivers like Simba etc. Hive drivers can help to 
> alleviate the issue without having separate HiveQL processors.
> 
> GC
> ____________
> From: Bryan Bende 
> Sent: Friday, March 24, 2023 4:05 PM
> To: dev@nifi.apache.org 
> Subject: Re: [discuss] NiFi support

Re: [discuss] NiFi support for Hadoop ecosystem components

2023-03-27 Thread Nandor Soma Abonyi
Thank you for raising this topic, Joe!

While I understand the desire to remove Hadoop components, I have mixed
feelings about removing one of the core parts of the Big Data world from the
project. I'm unsure for how many users we could make a hard time removing
those components. It seems to be a too significant shift in our philosophy.
We can already see in the above example that somebody would not use
NiFi if we'd removed them.

Furthermore, although Hadoop has been buried multiple times, new technologies
seem to still depend on it. For example, Iceberg, in which case I'm worried 
about
the consequences of removing the support for an increasingly popular technology.

So I wonder whether it is possible to find a forward-looking solution that could
serve all projects. I've always found configuring Hadoop and friends too tricky
and I thought it was primarily for historical reasons. The issues you describe 
could
easily result from such a thing. I assume that over time, new and new things 
have
been added on top of the existing implementation without significant 
refactoring.

My - probably utopistic - idea would be to contact the Hadoop and Hive teams
and share the issues we are dealing with. Probably we are not alone in these
problems, but I don't know whether they are aware of them. Even if they are,
I think approaching them is worth the chance. Who knows where we will end up
if somebody representing the NiFi project does that?

Regards,
Nandor Soma Abonyi


> On Mar 24, 2023, at 10:40 PM, Jeremy Dyer  wrote:
> 
> I think option 2 is the best way to handle this.
> 
> Technology naturally changes over time and some components of Nifi might not 
> make the most sense to keep around in the main line for the masses anymore. 
> However I really like still having them there for people to very simply add 
> if they so desire too. I see other platforms do this by adding a “contrib” 
> repo. What if we had something like a “nifi-contrib” or “nifi-emeritus” repo 
> in GitHub, Apache GitHub repo, where the community can still be involved as 
> desired but also keep things readily available to those who might not even be 
> heavily involved in the community?
> 
> I even see this as a sustainable pattern for any components that need “moved 
> out”.
> 
> I wouldn’t even think those components in the “contrib” repo would require 
> voting on for releases but someone, or a vendor, could update them via PRs 
> after the official release.
> 
> Jeremy Dyer
> 
> Get Outlook for iOS<https://aka.ms/o0ukef>
> 
> From: Chakravarty, G 
> Sent: Friday, March 24, 2023 4:36:43 PM
> To: dev@nifi.apache.org 
> Subject: Re: [discuss] NiFi support for Hadoop ecosystem components
> 
> I am wondering if the standard Nifi jdbc/odbc processors with some basic 
> testing with the common drivers like Simba etc. Hive drivers can help to 
> alleviate the issue without having separate HiveQL processors.
> 
> GC
> ____
> From: Bryan Bende 
> Sent: Friday, March 24, 2023 4:05 PM
> To: dev@nifi.apache.org 
> Subject: Re: [discuss] NiFi support for Hadoop ecosystem components
> 
> I lean towards option 2 with the caveat that maybe we don't have to
> retain every Hadoop related component when creating this separate set
> of components. Mainly I'm thinking that Hive has been the most
> problematic to maintain so maybe that is dropped all together. I think
> it would be unfortunate to not have publicly available HDFS
> processors.
> 
> On Fri, Mar 24, 2023 at 3:23 PM Matt Burgess  wrote:
>> 
>> As one of the small number of people that fight the battle, I like the
>> idea of Option 1 (full disclosure: I work for a vendor). From a
>> community standpoint (I'm on the PMC) I'm not strongly opposed to
>> Option 2 although I wouldn't want to be the one managing and releasing
>> the artifacts :) Having said that, unless it remained maintained and
>> released, I feel like it would just be a component graveyard (or maybe
>> more like the Apache Attic), in which case it seems unnecessary and
>> that's why I'm behind Option 1. Interested to hear others' thoughts of
>> course.
>> 
>> Thanks,
>> Matt
>> 
>> On Fri, Mar 24, 2023 at 2:07 PM Joe Witt  wrote:
>>> 
>>> Team,
>>> 
>>> For the full time NiFi has been in Apache we've built with support for
>>> various Hadoop ecosystem components like HDFS, Hive, HBase, others,
>>> and more recently formats/serialization modes like necessary for
>>> Parquet, Orc, Iceberg, etc..
>>> 
>>> All of these things however present endless challenges with
>>> compatibility across different versions (Hive being the most difficult
>>> by fa

Re: [discuss] NiFi support for Hadoop ecosystem components

2023-03-24 Thread Jeremy Dyer
I think option 2 is the best way to handle this.

Technology naturally changes over time and some components of Nifi might not 
make the most sense to keep around in the main line for the masses anymore. 
However I really like still having them there for people to very simply add if 
they so desire too. I see other platforms do this by adding a “contrib” repo. 
What if we had something like a “nifi-contrib” or “nifi-emeritus” repo in 
GitHub, Apache GitHub repo, where the community can still be involved as 
desired but also keep things readily available to those who might not even be 
heavily involved in the community?

I even see this as a sustainable pattern for any components that need “moved 
out”.

I wouldn’t even think those components in the “contrib” repo would require 
voting on for releases but someone, or a vendor, could update them via PRs 
after the official release.

Jeremy Dyer

Get Outlook for iOS<https://aka.ms/o0ukef>

From: Chakravarty, G 
Sent: Friday, March 24, 2023 4:36:43 PM
To: dev@nifi.apache.org 
Subject: Re: [discuss] NiFi support for Hadoop ecosystem components

I am wondering if the standard Nifi jdbc/odbc processors with some basic 
testing with the common drivers like Simba etc. Hive drivers can help to 
alleviate the issue without having separate HiveQL processors.

GC

From: Bryan Bende 
Sent: Friday, March 24, 2023 4:05 PM
To: dev@nifi.apache.org 
Subject: Re: [discuss] NiFi support for Hadoop ecosystem components

I lean towards option 2 with the caveat that maybe we don't have to
retain every Hadoop related component when creating this separate set
of components. Mainly I'm thinking that Hive has been the most
problematic to maintain so maybe that is dropped all together. I think
it would be unfortunate to not have publicly available HDFS
processors.

On Fri, Mar 24, 2023 at 3:23 PM Matt Burgess  wrote:
>
> As one of the small number of people that fight the battle, I like the
> idea of Option 1 (full disclosure: I work for a vendor). From a
> community standpoint (I'm on the PMC) I'm not strongly opposed to
> Option 2 although I wouldn't want to be the one managing and releasing
> the artifacts :) Having said that, unless it remained maintained and
> released, I feel like it would just be a component graveyard (or maybe
> more like the Apache Attic), in which case it seems unnecessary and
> that's why I'm behind Option 1. Interested to hear others' thoughts of
> course.
>
> Thanks,
> Matt
>
> On Fri, Mar 24, 2023 at 2:07 PM Joe Witt  wrote:
> >
> > Team,
> >
> > For the full time NiFi has been in Apache we've built with support for
> > various Hadoop ecosystem components like HDFS, Hive, HBase, others,
> > and more recently formats/serialization modes like necessary for
> > Parquet, Orc, Iceberg, etc..
> >
> > All of these things however present endless challenges with
> > compatibility across different versions (Hive being the most difficult
> > by far), vendors (hadoop vendors, cloud vendors, etc..).  And also
> > super notably the incredible number of dependencies, dependency
> > conflicts, inclusions/exclusions, old log libs, vulnerability updates,
> > etc..  And last but certainly not least a big reason why our build has
> > grown so much.
> >
> > We have a couple options:
> > 1. Deprecate these components in NiFi 1.x and drop them entirely in
> > NiFi 2.x.  Leave this as a problem for vendors to deal with.  NiFi
> > users interacting with such components are nearly exclusively doing so
> > with vendors anyway.
> >
> > 2. Remove the components from NiFi main code line and create a
> > separate repo for 'nifi-hadoop-extensions'.  We manage those
> > independently and release them periodically.  They would be available
> > for people to grab the nars if they want to use them.  We include none
> > of them in the convenience binary going forward by default.
> >
> > 3. Change nothing.  Continue to battle with the above listed items.
> > This is admittedly a bit of a non-option.  We can't keep spending the
> > same time/energy on these we have.  It is a very small number of
> > people that fight this battle.
> >
> > Look forward to hearing thoughts on these options or others we might 
> > consider.
> >
> > Thanks


Re: [discuss] NiFi support for Hadoop ecosystem components

2023-03-24 Thread Joe Witt
James

Some are definitely less fun than others with Hive being the most notable.

I should rephrase my vendor thing on point one: It is as far as I know all
vendor supported Hadoop components.  Whether NiFi is or not is a different
point.

Option 2 is the most realistic I suspect but still want to see what people
think.

Basically anything which depends on the ‘hadoop-client’ maven artifact is
where the games begin.

Thanks

On Fri, Mar 24, 2023 at 2:34 PM James Srinivasan 
wrote:

> I'm a Hadoop and Nifi user without vendor support so unsurprisingly aren't
> keen on #1, but then relying on community support and development is always
> going to be a risk for us. If it came to it, we'd probably stop using Nifi
> rather than pay a vendor which would be a real shame.
>
> Are certain Hadoop processors more maintenance heavy than others? Its a
> rather wide ecosystem.
>
> On Fri, 24 Mar 2023, 18:07 Joe Witt,  wrote:
>
> > Team,
> >
> > For the full time NiFi has been in Apache we've built with support for
> > various Hadoop ecosystem components like HDFS, Hive, HBase, others,
> > and more recently formats/serialization modes like necessary for
> > Parquet, Orc, Iceberg, etc..
> >
> > All of these things however present endless challenges with
> > compatibility across different versions (Hive being the most difficult
> > by far), vendors (hadoop vendors, cloud vendors, etc..).  And also
> > super notably the incredible number of dependencies, dependency
> > conflicts, inclusions/exclusions, old log libs, vulnerability updates,
> > etc..  And last but certainly not least a big reason why our build has
> > grown so much.
> >
> > We have a couple options:
> > 1. Deprecate these components in NiFi 1.x and drop them entirely in
> > NiFi 2.x.  Leave this as a problem for vendors to deal with.  NiFi
> > users interacting with such components are nearly exclusively doing so
> > with vendors anyway.
> >
> > 2. Remove the components from NiFi main code line and create a
> > separate repo for 'nifi-hadoop-extensions'.  We manage those
> > independently and release them periodically.  They would be available
> > for people to grab the nars if they want to use them.  We include none
> > of them in the convenience binary going forward by default.
> >
> > 3. Change nothing.  Continue to battle with the above listed items.
> > This is admittedly a bit of a non-option.  We can't keep spending the
> > same time/energy on these we have.  It is a very small number of
> > people that fight this battle.
> >
> > Look forward to hearing thoughts on these options or others we might
> > consider.
> >
> > Thanks
> >
>


Re: [discuss] NiFi support for Hadoop ecosystem components

2023-03-24 Thread James Srinivasan
I'm a Hadoop and Nifi user without vendor support so unsurprisingly aren't
keen on #1, but then relying on community support and development is always
going to be a risk for us. If it came to it, we'd probably stop using Nifi
rather than pay a vendor which would be a real shame.

Are certain Hadoop processors more maintenance heavy than others? Its a
rather wide ecosystem.

On Fri, 24 Mar 2023, 18:07 Joe Witt,  wrote:

> Team,
>
> For the full time NiFi has been in Apache we've built with support for
> various Hadoop ecosystem components like HDFS, Hive, HBase, others,
> and more recently formats/serialization modes like necessary for
> Parquet, Orc, Iceberg, etc..
>
> All of these things however present endless challenges with
> compatibility across different versions (Hive being the most difficult
> by far), vendors (hadoop vendors, cloud vendors, etc..).  And also
> super notably the incredible number of dependencies, dependency
> conflicts, inclusions/exclusions, old log libs, vulnerability updates,
> etc..  And last but certainly not least a big reason why our build has
> grown so much.
>
> We have a couple options:
> 1. Deprecate these components in NiFi 1.x and drop them entirely in
> NiFi 2.x.  Leave this as a problem for vendors to deal with.  NiFi
> users interacting with such components are nearly exclusively doing so
> with vendors anyway.
>
> 2. Remove the components from NiFi main code line and create a
> separate repo for 'nifi-hadoop-extensions'.  We manage those
> independently and release them periodically.  They would be available
> for people to grab the nars if they want to use them.  We include none
> of them in the convenience binary going forward by default.
>
> 3. Change nothing.  Continue to battle with the above listed items.
> This is admittedly a bit of a non-option.  We can't keep spending the
> same time/energy on these we have.  It is a very small number of
> people that fight this battle.
>
> Look forward to hearing thoughts on these options or others we might
> consider.
>
> Thanks
>


Re: [discuss] NiFi support for Hadoop ecosystem components

2023-03-24 Thread Chakravarty, G
I am wondering if the standard Nifi jdbc/odbc processors with some basic 
testing with the common drivers like Simba etc. Hive drivers can help to 
alleviate the issue without having separate HiveQL processors.

GC

From: Bryan Bende 
Sent: Friday, March 24, 2023 4:05 PM
To: dev@nifi.apache.org 
Subject: Re: [discuss] NiFi support for Hadoop ecosystem components

I lean towards option 2 with the caveat that maybe we don't have to
retain every Hadoop related component when creating this separate set
of components. Mainly I'm thinking that Hive has been the most
problematic to maintain so maybe that is dropped all together. I think
it would be unfortunate to not have publicly available HDFS
processors.

On Fri, Mar 24, 2023 at 3:23 PM Matt Burgess  wrote:
>
> As one of the small number of people that fight the battle, I like the
> idea of Option 1 (full disclosure: I work for a vendor). From a
> community standpoint (I'm on the PMC) I'm not strongly opposed to
> Option 2 although I wouldn't want to be the one managing and releasing
> the artifacts :) Having said that, unless it remained maintained and
> released, I feel like it would just be a component graveyard (or maybe
> more like the Apache Attic), in which case it seems unnecessary and
> that's why I'm behind Option 1. Interested to hear others' thoughts of
> course.
>
> Thanks,
> Matt
>
> On Fri, Mar 24, 2023 at 2:07 PM Joe Witt  wrote:
> >
> > Team,
> >
> > For the full time NiFi has been in Apache we've built with support for
> > various Hadoop ecosystem components like HDFS, Hive, HBase, others,
> > and more recently formats/serialization modes like necessary for
> > Parquet, Orc, Iceberg, etc..
> >
> > All of these things however present endless challenges with
> > compatibility across different versions (Hive being the most difficult
> > by far), vendors (hadoop vendors, cloud vendors, etc..).  And also
> > super notably the incredible number of dependencies, dependency
> > conflicts, inclusions/exclusions, old log libs, vulnerability updates,
> > etc..  And last but certainly not least a big reason why our build has
> > grown so much.
> >
> > We have a couple options:
> > 1. Deprecate these components in NiFi 1.x and drop them entirely in
> > NiFi 2.x.  Leave this as a problem for vendors to deal with.  NiFi
> > users interacting with such components are nearly exclusively doing so
> > with vendors anyway.
> >
> > 2. Remove the components from NiFi main code line and create a
> > separate repo for 'nifi-hadoop-extensions'.  We manage those
> > independently and release them periodically.  They would be available
> > for people to grab the nars if they want to use them.  We include none
> > of them in the convenience binary going forward by default.
> >
> > 3. Change nothing.  Continue to battle with the above listed items.
> > This is admittedly a bit of a non-option.  We can't keep spending the
> > same time/energy on these we have.  It is a very small number of
> > people that fight this battle.
> >
> > Look forward to hearing thoughts on these options or others we might 
> > consider.
> >
> > Thanks


Re: [discuss] NiFi support for Hadoop ecosystem components

2023-03-24 Thread Bryan Bende
I lean towards option 2 with the caveat that maybe we don't have to
retain every Hadoop related component when creating this separate set
of components. Mainly I'm thinking that Hive has been the most
problematic to maintain so maybe that is dropped all together. I think
it would be unfortunate to not have publicly available HDFS
processors.

On Fri, Mar 24, 2023 at 3:23 PM Matt Burgess  wrote:
>
> As one of the small number of people that fight the battle, I like the
> idea of Option 1 (full disclosure: I work for a vendor). From a
> community standpoint (I'm on the PMC) I'm not strongly opposed to
> Option 2 although I wouldn't want to be the one managing and releasing
> the artifacts :) Having said that, unless it remained maintained and
> released, I feel like it would just be a component graveyard (or maybe
> more like the Apache Attic), in which case it seems unnecessary and
> that's why I'm behind Option 1. Interested to hear others' thoughts of
> course.
>
> Thanks,
> Matt
>
> On Fri, Mar 24, 2023 at 2:07 PM Joe Witt  wrote:
> >
> > Team,
> >
> > For the full time NiFi has been in Apache we've built with support for
> > various Hadoop ecosystem components like HDFS, Hive, HBase, others,
> > and more recently formats/serialization modes like necessary for
> > Parquet, Orc, Iceberg, etc..
> >
> > All of these things however present endless challenges with
> > compatibility across different versions (Hive being the most difficult
> > by far), vendors (hadoop vendors, cloud vendors, etc..).  And also
> > super notably the incredible number of dependencies, dependency
> > conflicts, inclusions/exclusions, old log libs, vulnerability updates,
> > etc..  And last but certainly not least a big reason why our build has
> > grown so much.
> >
> > We have a couple options:
> > 1. Deprecate these components in NiFi 1.x and drop them entirely in
> > NiFi 2.x.  Leave this as a problem for vendors to deal with.  NiFi
> > users interacting with such components are nearly exclusively doing so
> > with vendors anyway.
> >
> > 2. Remove the components from NiFi main code line and create a
> > separate repo for 'nifi-hadoop-extensions'.  We manage those
> > independently and release them periodically.  They would be available
> > for people to grab the nars if they want to use them.  We include none
> > of them in the convenience binary going forward by default.
> >
> > 3. Change nothing.  Continue to battle with the above listed items.
> > This is admittedly a bit of a non-option.  We can't keep spending the
> > same time/energy on these we have.  It is a very small number of
> > people that fight this battle.
> >
> > Look forward to hearing thoughts on these options or others we might 
> > consider.
> >
> > Thanks


Re: [discuss] NiFi support for Hadoop ecosystem components

2023-03-24 Thread Matt Burgess
As one of the small number of people that fight the battle, I like the
idea of Option 1 (full disclosure: I work for a vendor). From a
community standpoint (I'm on the PMC) I'm not strongly opposed to
Option 2 although I wouldn't want to be the one managing and releasing
the artifacts :) Having said that, unless it remained maintained and
released, I feel like it would just be a component graveyard (or maybe
more like the Apache Attic), in which case it seems unnecessary and
that's why I'm behind Option 1. Interested to hear others' thoughts of
course.

Thanks,
Matt

On Fri, Mar 24, 2023 at 2:07 PM Joe Witt  wrote:
>
> Team,
>
> For the full time NiFi has been in Apache we've built with support for
> various Hadoop ecosystem components like HDFS, Hive, HBase, others,
> and more recently formats/serialization modes like necessary for
> Parquet, Orc, Iceberg, etc..
>
> All of these things however present endless challenges with
> compatibility across different versions (Hive being the most difficult
> by far), vendors (hadoop vendors, cloud vendors, etc..).  And also
> super notably the incredible number of dependencies, dependency
> conflicts, inclusions/exclusions, old log libs, vulnerability updates,
> etc..  And last but certainly not least a big reason why our build has
> grown so much.
>
> We have a couple options:
> 1. Deprecate these components in NiFi 1.x and drop them entirely in
> NiFi 2.x.  Leave this as a problem for vendors to deal with.  NiFi
> users interacting with such components are nearly exclusively doing so
> with vendors anyway.
>
> 2. Remove the components from NiFi main code line and create a
> separate repo for 'nifi-hadoop-extensions'.  We manage those
> independently and release them periodically.  They would be available
> for people to grab the nars if they want to use them.  We include none
> of them in the convenience binary going forward by default.
>
> 3. Change nothing.  Continue to battle with the above listed items.
> This is admittedly a bit of a non-option.  We can't keep spending the
> same time/energy on these we have.  It is a very small number of
> people that fight this battle.
>
> Look forward to hearing thoughts on these options or others we might consider.
>
> Thanks


[discuss] NiFi support for Hadoop ecosystem components

2023-03-24 Thread Joe Witt
Team,

For the full time NiFi has been in Apache we've built with support for
various Hadoop ecosystem components like HDFS, Hive, HBase, others,
and more recently formats/serialization modes like necessary for
Parquet, Orc, Iceberg, etc..

All of these things however present endless challenges with
compatibility across different versions (Hive being the most difficult
by far), vendors (hadoop vendors, cloud vendors, etc..).  And also
super notably the incredible number of dependencies, dependency
conflicts, inclusions/exclusions, old log libs, vulnerability updates,
etc..  And last but certainly not least a big reason why our build has
grown so much.

We have a couple options:
1. Deprecate these components in NiFi 1.x and drop them entirely in
NiFi 2.x.  Leave this as a problem for vendors to deal with.  NiFi
users interacting with such components are nearly exclusively doing so
with vendors anyway.

2. Remove the components from NiFi main code line and create a
separate repo for 'nifi-hadoop-extensions'.  We manage those
independently and release them periodically.  They would be available
for people to grab the nars if they want to use them.  We include none
of them in the convenience binary going forward by default.

3. Change nothing.  Continue to battle with the above listed items.
This is admittedly a bit of a non-option.  We can't keep spending the
same time/energy on these we have.  It is a very small number of
people that fight this battle.

Look forward to hearing thoughts on these options or others we might consider.

Thanks