Re: [discuss] NiFi support for Hadoop ecosystem components

Joe Witt Wed, 29 Mar 2023 13:20:32 -0700

Given the discussion my sense is that clean option 1 is not really
realistic.  We're probably looking more likely at option 2 and just
have a far less frequent release cycle for them and thus lower
overhead.  My biggest gripe is the constant workarounds we have in all
these poms to wrangle dependencies and include/exclude simply because
those projects do not maintain them nearly with the speed/focus that
we do.


Thanks

On Wed, Mar 29, 2023 at 1:07 PM Steven Matison <[email protected]> wrote:
>
> I waited to respond becuase I wanted to see the conversation before jumping
> in.
>
> As a user, fan, and consultant I am super excited to even know context
> around nifi 2.0 and entire modernization effort that is already on going to
> end of 1.x. I am also thankful to be working on some of it myself too.
> This is an amazing time for nifi.
>
> In respect of anyone using "old nifi" this topic is super concerning, but
> to me this is not more concerning than why you are still using older
> versions and/or why are you needing deep backward compatibility with new
> versions that are several years from where you are.  Are you not also
> modernizing?
>
> Those things aside, my choice is #1.  I speak for those using current
> version all way back to some of oldest versions you could imagine are still
> online.    These people will continue to operate nifi the way they need it
> in those versions.  If and when they decide to take a new version, they
> will, and they will also deal with the inherit challenges to modernize.
> Many of them work completely outside of community and will do that
> themselves or rely on a vendor who can support that path for them.
>
>
>
>
>
> On Wed, Mar 29, 2023 at 11:03 AM Chakravarty, G <[email protected]> wrote:
>
> > One of our primary reasons for using Nifi is that it plays nicely with
> > connecting with on-prem HDFS/Hive/Kudu data stores. Also, it appears that
> > although the on-prem hadoop/hive tech stack is somewhat less popular now,
> > the same hdfs/hive technology is appearing in the cloud under different
> > names: Google Dataproc, AWS EMR, Azure HDinsight, Iceberg etc. Some type of
> > generic components where the Hadoop processors connectivity to Nifi is
> > maintained while individual vendors maintain their own connectivity to
> > their products will be a good option if possible.
> >
> > GC
> >
> > ________________________________
> > From: Isha Lamboo <[email protected]>
> > Sent: Monday, March 27, 2023 9:04 AM
> > To: [email protected] <[email protected]>
> > Subject: RE: [discuss] NiFi support for Hadoop ecosystem components
> >
> > From the perspective of a NiFi administrator:
> >
> > Removing the xxxHDFS processors anytime soon (2.0) would be a huge issue
> > for us. It shouldn't be, the last Hadoop cluster in our environments was
> > shut down earlier this year. Hive was already gone more than a year ago.
> > But we still have 1000+ HDFS processors in use to manage the Azure
> > Datalake. Azure-specific processors have been available for a while, but
> > there was no business case to migrating solutions that were working fine.
> >
> > Getting the required development time/budget to migrate all those flows to
> > the Azure processors doesn't look very realistic. This would have to be a
> > gradual "replace when you need to change and test the flow anyway" affair.
> > Until that finishes, we'd be stuck on the 1.x branch since we're not using
> > vendor support.
> >
> > Option #2 would be vastly preferable to #1 for this simple and dumb reason.
> >
> > Disregarding our technical debt issues, I agree that it makes sense for
> > NiFi instances with a lot of Hadoop integration to depend on vendors for
> > their specific flavor of Hadoop, while core NiFi moves forward without all
> > of that complexity.
> >
> > Regards,
> >
> > Isha
> >
> > -----Oorspronkelijk bericht-----
> > Van: Nandor Soma Abonyi <[email protected]>
> > Verzonden: maandag 27 maart 2023 12:31
> > Aan: [email protected]
> > Onderwerp: Re: [discuss] NiFi support for Hadoop ecosystem components
> >
> > Thank you for raising this topic, Joe!
> >
> > While I understand the desire to remove Hadoop components, I have mixed
> > feelings about removing one of the core parts of the Big Data world from
> > the project. I'm unsure for how many users we could make a hard time
> > removing those components. It seems to be a too significant shift in our
> > philosophy.
> > We can already see in the above example that somebody would not use NiFi
> > if we'd removed them.
> >
> > Furthermore, although Hadoop has been buried multiple times, new
> > technologies seem to still depend on it. For example, Iceberg, in which
> > case I'm worried about the consequences of removing the support for an
> > increasingly popular technology.
> >
> > So I wonder whether it is possible to find a forward-looking solution that
> > could serve all projects. I've always found configuring Hadoop and friends
> > too tricky and I thought it was primarily for historical reasons. The
> > issues you describe could easily result from such a thing. I assume that
> > over time, new and new things have been added on top of the existing
> > implementation without significant refactoring.
> >
> > My - probably utopistic - idea would be to contact the Hadoop and Hive
> > teams and share the issues we are dealing with. Probably we are not alone
> > in these problems, but I don't know whether they are aware of them. Even if
> > they are, I think approaching them is worth the chance. Who knows where we
> > will end up if somebody representing the NiFi project does that?
> >
> > Regards,
> > Nandor Soma Abonyi
> >
> >
> > > On Mar 24, 2023, at 10:40 PM, Jeremy Dyer <[email protected]> wrote:
> > >
> > > I think option 2 is the best way to handle this.
> > >
> > > Technology naturally changes over time and some components of Nifi might
> > not make the most sense to keep around in the main line for the masses
> > anymore. However I really like still having them there for people to very
> > simply add if they so desire too. I see other platforms do this by adding a
> > “contrib” repo. What if we had something like a “nifi-contrib” or
> > “nifi-emeritus” repo in GitHub, Apache GitHub repo, where the community can
> > still be involved as desired but also keep things readily available to
> > those who might not even be heavily involved in the community?
> > >
> > > I even see this as a sustainable pattern for any components that need
> > “moved out”.
> > >
> > > I wouldn’t even think those components in the “contrib” repo would
> > require voting on for releases but someone, or a vendor, could update them
> > via PRs after the official release.
> > >
> > > Jeremy Dyer
> > >
> > > Get Outlook for
> > > iOS<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> > > aka.ms%2Fo0ukef&data=05%7C01%7Cisha.lamboo%40virtualsciences.nl%7C7a74
> > > 6c107132419b7ec808db2eae6b08%7C21429da9e4ad45f99a6fcd126a64274b%7C0%7C
> > > 0%7C638155098878698642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> > > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xXRy
> > > LdqqQND5lG1MaBEonKblKwlpmMdKvOH34FouBPI%3D&reserved=0>
> > > ________________________________
> > > From: Chakravarty, G <[email protected]>
> > > Sent: Friday, March 24, 2023 4:36:43 PM
> > > To: [email protected] <[email protected]>
> > > Subject: Re: [discuss] NiFi support for Hadoop ecosystem components
> > >
> > > I am wondering if the standard Nifi jdbc/odbc processors with some basic
> > testing with the common drivers like Simba etc. Hive drivers can help to
> > alleviate the issue without having separate HiveQL processors.
> > >
> > > GC
> > > ________________________________
> > > From: Bryan Bende <[email protected]>
> > > Sent: Friday, March 24, 2023 4:05 PM
> > > To: [email protected] <[email protected]>
> > > Subject: Re: [discuss] NiFi support for Hadoop ecosystem components
> > >
> > > I lean towards option 2 with the caveat that maybe we don't have to
> > > retain every Hadoop related component when creating this separate set
> > > of components. Mainly I'm thinking that Hive has been the most
> > > problematic to maintain so maybe that is dropped all together. I think
> > > it would be unfortunate to not have publicly available HDFS
> > > processors.
> > >
> > > On Fri, Mar 24, 2023 at 3:23 PM Matt Burgess <[email protected]>
> > wrote:
> > >>
> > >> As one of the small number of people that fight the battle, I like
> > >> the idea of Option 1 (full disclosure: I work for a vendor). From a
> > >> community standpoint (I'm on the PMC) I'm not strongly opposed to
> > >> Option 2 although I wouldn't want to be the one managing and
> > >> releasing the artifacts :) Having said that, unless it remained
> > >> maintained and released, I feel like it would just be a component
> > >> graveyard (or maybe more like the Apache Attic), in which case it
> > >> seems unnecessary and that's why I'm behind Option 1. Interested to
> > >> hear others' thoughts of course.
> > >>
> > >> Thanks,
> > >> Matt
> > >>
> > >> On Fri, Mar 24, 2023 at 2:07 PM Joe Witt <[email protected]> wrote:
> > >>>
> > >>> Team,
> > >>>
> > >>> For the full time NiFi has been in Apache we've built with support
> > >>> for various Hadoop ecosystem components like HDFS, Hive, HBase,
> > >>> others, and more recently formats/serialization modes like necessary
> > >>> for Parquet, Orc, Iceberg, etc..
> > >>>
> > >>> All of these things however present endless challenges with
> > >>> compatibility across different versions (Hive being the most
> > >>> difficult by far), vendors (hadoop vendors, cloud vendors, etc..).
> > >>> And also super notably the incredible number of dependencies,
> > >>> dependency conflicts, inclusions/exclusions, old log libs,
> > >>> vulnerability updates, etc..  And last but certainly not least a big
> > >>> reason why our build has grown so much.
> > >>>
> > >>> We have a couple options:
> > >>> 1. Deprecate these components in NiFi 1.x and drop them entirely in
> > >>> NiFi 2.x.  Leave this as a problem for vendors to deal with.  NiFi
> > >>> users interacting with such components are nearly exclusively doing
> > >>> so with vendors anyway.
> > >>>
> > >>> 2. Remove the components from NiFi main code line and create a
> > >>> separate repo for 'nifi-hadoop-extensions'.  We manage those
> > >>> independently and release them periodically.  They would be
> > >>> available for people to grab the nars if they want to use them.  We
> > >>> include none of them in the convenience binary going forward by
> > default.
> > >>>
> > >>> 3. Change nothing.  Continue to battle with the above listed items.
> > >>> This is admittedly a bit of a non-option.  We can't keep spending
> > >>> the same time/energy on these we have.  It is a very small number of
> > >>> people that fight this battle.
> > >>>
> > >>> Look forward to hearing thoughts on these options or others we might
> > consider.
> > >>>
> > >>> Thanks
> >
> >

Re: [discuss] NiFi support for Hadoop ecosystem components

Reply via email to