One of our primary reasons for using Nifi is that it plays nicely with connecting with on-prem HDFS/Hive/Kudu data stores. Also, it appears that although the on-prem hadoop/hive tech stack is somewhat less popular now, the same hdfs/hive technology is appearing in the cloud under different names: Google Dataproc, AWS EMR, Azure HDinsight, Iceberg etc. Some type of generic components where the Hadoop processors connectivity to Nifi is maintained while individual vendors maintain their own connectivity to their products will be a good option if possible.
GC ________________________________ From: Isha Lamboo <isha.lam...@virtualsciences.nl> Sent: Monday, March 27, 2023 9:04 AM To: dev@nifi.apache.org <dev@nifi.apache.org> Subject: RE: [discuss] NiFi support for Hadoop ecosystem components From the perspective of a NiFi administrator: Removing the xxxHDFS processors anytime soon (2.0) would be a huge issue for us. It shouldn't be, the last Hadoop cluster in our environments was shut down earlier this year. Hive was already gone more than a year ago. But we still have 1000+ HDFS processors in use to manage the Azure Datalake. Azure-specific processors have been available for a while, but there was no business case to migrating solutions that were working fine. Getting the required development time/budget to migrate all those flows to the Azure processors doesn't look very realistic. This would have to be a gradual "replace when you need to change and test the flow anyway" affair. Until that finishes, we'd be stuck on the 1.x branch since we're not using vendor support. Option #2 would be vastly preferable to #1 for this simple and dumb reason. Disregarding our technical debt issues, I agree that it makes sense for NiFi instances with a lot of Hadoop integration to depend on vendors for their specific flavor of Hadoop, while core NiFi moves forward without all of that complexity. Regards, Isha -----Oorspronkelijk bericht----- Van: Nandor Soma Abonyi <nsabo...@icloud.com.INVALID> Verzonden: maandag 27 maart 2023 12:31 Aan: dev@nifi.apache.org Onderwerp: Re: [discuss] NiFi support for Hadoop ecosystem components Thank you for raising this topic, Joe! While I understand the desire to remove Hadoop components, I have mixed feelings about removing one of the core parts of the Big Data world from the project. I'm unsure for how many users we could make a hard time removing those components. It seems to be a too significant shift in our philosophy. We can already see in the above example that somebody would not use NiFi if we'd removed them. Furthermore, although Hadoop has been buried multiple times, new technologies seem to still depend on it. For example, Iceberg, in which case I'm worried about the consequences of removing the support for an increasingly popular technology. So I wonder whether it is possible to find a forward-looking solution that could serve all projects. I've always found configuring Hadoop and friends too tricky and I thought it was primarily for historical reasons. The issues you describe could easily result from such a thing. I assume that over time, new and new things have been added on top of the existing implementation without significant refactoring. My - probably utopistic - idea would be to contact the Hadoop and Hive teams and share the issues we are dealing with. Probably we are not alone in these problems, but I don't know whether they are aware of them. Even if they are, I think approaching them is worth the chance. Who knows where we will end up if somebody representing the NiFi project does that? Regards, Nandor Soma Abonyi > On Mar 24, 2023, at 10:40 PM, Jeremy Dyer <jdy...@gmail.com> wrote: > > I think option 2 is the best way to handle this. > > Technology naturally changes over time and some components of Nifi might not > make the most sense to keep around in the main line for the masses anymore. > However I really like still having them there for people to very simply add > if they so desire too. I see other platforms do this by adding a “contrib” > repo. What if we had something like a “nifi-contrib” or “nifi-emeritus” repo > in GitHub, Apache GitHub repo, where the community can still be involved as > desired but also keep things readily available to those who might not even be > heavily involved in the community? > > I even see this as a sustainable pattern for any components that need “moved > out”. > > I wouldn’t even think those components in the “contrib” repo would require > voting on for releases but someone, or a vendor, could update them via PRs > after the official release. > > Jeremy Dyer > > Get Outlook for > iOS<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2F > aka.ms%2Fo0ukef&data=05%7C01%7Cisha.lamboo%40virtualsciences.nl%7C7a74 > 6c107132419b7ec808db2eae6b08%7C21429da9e4ad45f99a6fcd126a64274b%7C0%7C > 0%7C638155098878698642%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xXRy > LdqqQND5lG1MaBEonKblKwlpmMdKvOH34FouBPI%3D&reserved=0> > ________________________________ > From: Chakravarty, G <g.c...@plenium.com> > Sent: Friday, March 24, 2023 4:36:43 PM > To: dev@nifi.apache.org <dev@nifi.apache.org> > Subject: Re: [discuss] NiFi support for Hadoop ecosystem components > > I am wondering if the standard Nifi jdbc/odbc processors with some basic > testing with the common drivers like Simba etc. Hive drivers can help to > alleviate the issue without having separate HiveQL processors. > > GC > ________________________________ > From: Bryan Bende <bbe...@gmail.com> > Sent: Friday, March 24, 2023 4:05 PM > To: dev@nifi.apache.org <dev@nifi.apache.org> > Subject: Re: [discuss] NiFi support for Hadoop ecosystem components > > I lean towards option 2 with the caveat that maybe we don't have to > retain every Hadoop related component when creating this separate set > of components. Mainly I'm thinking that Hive has been the most > problematic to maintain so maybe that is dropped all together. I think > it would be unfortunate to not have publicly available HDFS > processors. > > On Fri, Mar 24, 2023 at 3:23 PM Matt Burgess <mattyb...@apache.org> wrote: >> >> As one of the small number of people that fight the battle, I like >> the idea of Option 1 (full disclosure: I work for a vendor). From a >> community standpoint (I'm on the PMC) I'm not strongly opposed to >> Option 2 although I wouldn't want to be the one managing and >> releasing the artifacts :) Having said that, unless it remained >> maintained and released, I feel like it would just be a component >> graveyard (or maybe more like the Apache Attic), in which case it >> seems unnecessary and that's why I'm behind Option 1. Interested to >> hear others' thoughts of course. >> >> Thanks, >> Matt >> >> On Fri, Mar 24, 2023 at 2:07 PM Joe Witt <joe.w...@gmail.com> wrote: >>> >>> Team, >>> >>> For the full time NiFi has been in Apache we've built with support >>> for various Hadoop ecosystem components like HDFS, Hive, HBase, >>> others, and more recently formats/serialization modes like necessary >>> for Parquet, Orc, Iceberg, etc.. >>> >>> All of these things however present endless challenges with >>> compatibility across different versions (Hive being the most >>> difficult by far), vendors (hadoop vendors, cloud vendors, etc..). >>> And also super notably the incredible number of dependencies, >>> dependency conflicts, inclusions/exclusions, old log libs, >>> vulnerability updates, etc.. And last but certainly not least a big >>> reason why our build has grown so much. >>> >>> We have a couple options: >>> 1. Deprecate these components in NiFi 1.x and drop them entirely in >>> NiFi 2.x. Leave this as a problem for vendors to deal with. NiFi >>> users interacting with such components are nearly exclusively doing >>> so with vendors anyway. >>> >>> 2. Remove the components from NiFi main code line and create a >>> separate repo for 'nifi-hadoop-extensions'. We manage those >>> independently and release them periodically. They would be >>> available for people to grab the nars if they want to use them. We >>> include none of them in the convenience binary going forward by default. >>> >>> 3. Change nothing. Continue to battle with the above listed items. >>> This is admittedly a bit of a non-option. We can't keep spending >>> the same time/energy on these we have. It is a very small number of >>> people that fight this battle. >>> >>> Look forward to hearing thoughts on these options or others we might >>> consider. >>> >>> Thanks