Dear Joe Regarding you point 5. This is almost also what I'm doing. But last night at my phone I "just wrote" we created a hash file. What I'm actually doing is converting the flowfile to json. Are there a way where NIFI can export the complete flowfile (attributes and content) into 1 file, which we can import again on the other side? Right now I do it in 2 steps Below is a short description of my flow for transferring data between systems where we can't use S2S. At low side: get data -> CryptographicHashContent -> UpdateAttribute: original.filename = ${filename}, rootHash=${content_SHA-256} -> UpdateAttribute: filename=${UUID()} -> PutSFTP -> AttributesToJSON: Destination=flowfile-content -> UpdateAttribute: filename=${filename:append('.flowfile')} -> PutSFTP
At high side: ListSFTP: File filter Regex = .*\.flowfile -> FetchSFTP -> ExecuteScript: (converting json data into attributes) -> UpdateAttribute: filename = ${filename:substringBefore('.flowfile')} -> FetchSFTP -> CryptographicHashContent -> RouteOnAttribute: Hash_OK = ${rootHash:equals(${content_SHA-256})} -> Hash_OK -> following production flow Unmatched -> Error flow Kind regards Jens Den tir. 12. okt. 2021 kl. 21.36 skrev Joe Witt <joe.w...@gmail.com>: > Jens > > For such a setup the very specific details matter and here there are a > lot of details. It isn't easy to sort through this for me so I'll > keep it high level based on my experience in very similar > situations/setups: > > 1. I'd generally trust SFTP to be awesome and damn near failure proof > in itself. I'd focus on other things. > 2. I'd generally trust that data packet corruption in terms of network > transfer is bulletproof and not think that is a problem especially > since SFTP and various protocols employed here offer certain > guarantees themselves (including nifi). > 3. I'd be suspect of one way transfer/guard devices creating issues. > I'd remove that and try to reproduce the problem. > 4. In linux a cp/mv is not atomic as I understand if data is spanning > across file systems so you could have partially written data scenarios > here potentially. > 5. I'd be careful to avoid multiple file scenarios such as original > content and the sha256. Instead if the low side is a NiFi and the > high side is a NiFi I'd have lowside nifi write out flowfiles and pass > those over the guard device. Why? Because this gives you your > original content AND the flowfile attributes (where I'd have the > sha256). On the high side nifi i'd unpack that flow file and ensure > the content matches the stated sha256. > > Joe > > On Tue, Oct 12, 2021 at 12:25 PM Jens M. Kofoed <jmkofoed....@gmail.com> > wrote: > > > > Hi Joe > > > > I know what you are thinking but that’s not the case. > > Check my very short description of my test flow. > > In my loop the PutSFTP process is using default settings which means > it’s uploading files as .filename and rename it when done. The next process > is the FetchSFTP which will load the file as filename. If PutSFTP is not > finished uploading the file it will have the wrong filename and the flow > file will not go from the PutSFTP -> FetchSFTP and therefore the FetchSFTP > can’t fetch the file. So in my test flow it is not the case. > > > > In our production flow, after nifi gets its data it calculates the > sha256. uploads the data to a sftp server as .filename and rename it when > done. Default settings for PutSFTP. Next it create a new file with the > value of the hash and save it as filename.sha256. > > At that sftp server a bash script is looking for NOT hidden files every > 2 seconds with a ls command. If there are files the bash script does a cp > filename /archive/filename and sends the data to server 3 via a data diode. > At the other side another nifi server reads the filename.sha256, reads in > the hash value and reads in the original data. Calculate a new sha256 and > compare the two hashes. > > Yesterday there was a corruption again and we checked the file at the > first sftp server where the first nifi saved it after creating the first > hash. Running a sha256sum at the /archive/filename produced a different > hash than nifi. So after the PutSFTP and a Linux cp command the file was > corrupted. > > It have been less than 1 file pr. 1.000.000 files where we have seen > theses issues. But we see them. > > Now we try to investigate that course the issue. Therefore I created the > small test flow and already after nearly 9000 iteration in the loop the > file has been corrupted just being uploaded and downloaded again. > > > > Are we facing a network issue where a data packed is corrupted? > > Are there a very rare cases where the sftp implementation is doing > something wrong? > > We don’t know yet but we are running some more tests and at different > systems to narrow it down > > > > Kind regards > > Jens M. Kofoed > > > > > Den 12. okt. 2021 kl. 19.39 skrev Joe Witt <joe.w...@gmail.com>: > > > > > > Hello > > > > > > How does nifi grab the data from the file system? It sounds like it is > > > doing partial reads due to a competing consumer (data still being > written) > > > scenario. > > > > > > Thanks > > > > > > On Mon, Oct 11, 2021 at 10:36 PM Jens M. Kofoed < > jmkofoed....@gmail.com> > > > wrote: > > > > > >> Dear Developers > > >> > > >> We have a situation where we see corrupted file after using PutSFTP > and > > >> FetchSFTP in NIFI 1.13.2 with openjdk version "1.8.0_292", OpenJDK > Runtime > > >> Environment (build 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10), OpenJDK > 64-Bit > > >> Server VM (build 25.292-b10, mixed mode) running on a Ubuntu Server > 20.04 > > >> > > >> We have a flow between 2 separated systems where we use a PUTSFTP to > export > > >> data from one NIFI instance to a datadiode and use FetchSFTP to grep > data > > >> on the other end. To be sure data is not corrupted we calculate a > SHA256 on > > >> each side, and transfer the flowfile metadata in a seperate file. In > rare > > >> cases have see that the SHA256 doesn't match on both sides and are > > >> investigation where the errors happens. We see 2 errors. Manually > > >> calculation a SHA256 on both side of the diodes the file is OK and we > have > > >> found that the errors at happens between NIFI and the SFTP servers. > And it > > >> can happens at both sides. > > >> So for testing I created this little flow: > > >> GeneratingFlowFile (size 100MB) (Run once) -> > > >> CryptographicHashContent (SHA256) -> > > >> UpdateAttribute ( hash.root = ${content_SHA-256} , iteration=1) -> > > >> PutSFTP -> > > >> FetchSFTP -> > > >> CryptographicHashContent (SHA256) -> > > >> routeOnAttribute (compare root.hash vs.content_SHA-256) > > >> If unmatch -> > > >> Going to a disabled process for placeholding the corrupted > file in > > >> a file queue > > >> If match -> > > >> UpdateAttribute ( iteration= ${iteration:plus(1)} ) -> looping > back > > >> to PutSFTP > > >> > > >> After 8992 iteration the file is corrupted. To test if the errors are > in > > >> the calculation of the SHA256 I have a copy of the flow without the > > >> PUT/FETCH SFTP processors which haven't got any errors yet. > > >> > > >> It is very rare that we see these errors, millions of files are going > > >> through without any issues but some time it happens which is not good. > > >> > > >> Can any one please help? Maybe trying to setup the same test and see > if you > > >> also have a corrupted file after some days. > > >> > > >> Kind regards > > >> Jens M. Kofoed > > >> >