Dear Joe

Regarding you point 5. This is almost also what I'm doing. But last night
at my phone I "just wrote" we created a hash file. What I'm actually doing
is converting the flowfile to json.
Are there a way where NIFI can export the complete flowfile (attributes and
content) into 1 file, which we can import again on the other side? Right
now I do it in 2 steps
Below is a short description of my flow for transferring data between
systems where we can't use S2S.
At low side:
get data ->
  CryptographicHashContent ->
    UpdateAttribute: original.filename = ${filename},
rootHash=${content_SHA-256} ->
      UpdateAttribute: filename=${UUID()} ->
        PutSFTP ->
          AttributesToJSON: Destination=flowfile-content ->
            UpdateAttribute: filename=${filename:append('.flowfile')} ->
              PutSFTP

At high side:
ListSFTP: File filter Regex = .*\.flowfile ->
  FetchSFTP ->
    ExecuteScript: (converting json data into attributes) ->
      UpdateAttribute: filename = ${filename:substringBefore('.flowfile')}
->
        FetchSFTP ->
          CryptographicHashContent ->
            RouteOnAttribute: Hash_OK =
${rootHash:equals(${content_SHA-256})} ->
              Hash_OK -> following production flow
              Unmatched -> Error flow

Kind regards
Jens

Den tir. 12. okt. 2021 kl. 21.36 skrev Joe Witt <joe.w...@gmail.com>:

> Jens
>
> For such a setup the very specific details matter and here there are a
> lot of details.  It isn't easy to sort through this for me so I'll
> keep it high level based on my experience in very similar
> situations/setups:
>
> 1. I'd generally trust SFTP to be awesome and damn near failure proof
> in itself.  I'd focus on other things.
> 2. I'd generally trust that data packet corruption in terms of network
> transfer is bulletproof and not think that is a problem especially
> since SFTP and various protocols employed here offer certain
> guarantees themselves (including nifi).
> 3. I'd be suspect of one way transfer/guard devices creating issues.
> I'd remove that and try to reproduce the problem.
> 4. In linux a cp/mv is not atomic as I understand if data is spanning
> across file systems so you could have partially written data scenarios
> here potentially.
> 5. I'd be careful to avoid multiple file scenarios such as original
> content and the sha256.  Instead if the low side is a NiFi and the
> high side is a NiFi I'd have lowside nifi write out flowfiles and pass
> those over the guard device.  Why?  Because this gives you your
> original content AND the flowfile attributes (where I'd have the
> sha256).  On the high side nifi i'd unpack that flow file and ensure
> the content matches the stated sha256.
>
> Joe
>
> On Tue, Oct 12, 2021 at 12:25 PM Jens M. Kofoed <jmkofoed....@gmail.com>
> wrote:
> >
> > Hi Joe
> >
> > I know what you are thinking but that’s not the case.
> > Check my very short description of my test flow.
> > In my loop the PutSFTP process is using default settings which means
> it’s uploading files as .filename and rename it when done. The next process
> is the FetchSFTP which will load the file as filename. If PutSFTP is not
> finished uploading the file it will have the wrong filename and the flow
> file will not go from the PutSFTP -> FetchSFTP and therefore the FetchSFTP
> can’t fetch the file. So in my test flow it is not the case.
> >
> > In our production flow, after nifi gets its data it calculates the
> sha256.  uploads the data to a sftp server as .filename and rename it when
> done. Default settings for PutSFTP. Next it create a new file with the
> value of the hash and save it as filename.sha256.
> >  At that sftp server a bash script is looking for NOT hidden files every
> 2 seconds with a ls command. If there are files the bash script does a cp
> filename /archive/filename and sends the data to server 3 via a data diode.
> At the other side another nifi server reads the filename.sha256, reads in
> the hash value and reads in the original data. Calculate a new sha256 and
> compare the two hashes.
> > Yesterday there was a corruption again and we checked the file at the
> first sftp server where the first nifi saved it after creating the first
> hash. Running a sha256sum at the /archive/filename produced a different
> hash than nifi. So after the PutSFTP and a Linux cp command the file was
> corrupted.
> > It have been less than 1 file pr. 1.000.000 files where we have seen
> theses issues. But we see them.
> > Now we try to investigate that course the issue. Therefore I created the
> small test flow and already after nearly 9000 iteration in the loop the
> file has been corrupted just being uploaded and downloaded again.
> >
> > Are we facing a network issue where a data packed is corrupted?
> > Are there a very rare cases where the sftp implementation is doing
> something wrong?
> > We don’t know yet but we are running some more tests and at different
> systems to narrow it down
> >
> > Kind regards
> > Jens M. Kofoed
> >
> > > Den 12. okt. 2021 kl. 19.39 skrev Joe Witt <joe.w...@gmail.com>:
> > >
> > > Hello
> > >
> > > How does nifi grab the data from the file system?  It sounds like it is
> > > doing partial reads due to a competing consumer (data still being
> written)
> > > scenario.
> > >
> > > Thanks
> > >
> > > On Mon, Oct 11, 2021 at 10:36 PM Jens M. Kofoed <
> jmkofoed....@gmail.com>
> > > wrote:
> > >
> > >> Dear Developers
> > >>
> > >> We have a situation where we see corrupted file after using PutSFTP
> and
> > >> FetchSFTP in NIFI 1.13.2 with openjdk version "1.8.0_292", OpenJDK
> Runtime
> > >> Environment (build 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10), OpenJDK
> 64-Bit
> > >> Server VM (build 25.292-b10, mixed mode) running on a Ubuntu Server
> 20.04
> > >>
> > >> We have a flow between 2 separated systems where we use a PUTSFTP to
> export
> > >> data from one NIFI instance to a datadiode and use FetchSFTP to grep
> data
> > >> on the other end. To be sure data is not corrupted we calculate a
> SHA256 on
> > >> each side, and transfer the flowfile metadata in a seperate file. In
> rare
> > >> cases have see that the SHA256 doesn't match on both sides and are
> > >> investigation where the errors happens. We see 2 errors. Manually
> > >> calculation a SHA256 on both side of the diodes the file is OK and we
> have
> > >> found that the errors at  happens between NIFI and the SFTP servers.
> And it
> > >> can happens at both sides.
> > >> So for testing I created this little flow:
> > >> GeneratingFlowFile (size 100MB) (Run once) ->
> > >> CryptographicHashContent (SHA256) ->
> > >> UpdateAttribute ( hash.root = ${content_SHA-256} , iteration=1) ->
> > >> PutSFTP ->
> > >> FetchSFTP ->
> > >> CryptographicHashContent (SHA256) ->
> > >> routeOnAttribute (compare root.hash vs.content_SHA-256)
> > >>    If unmatch ->
> > >>        Going to a disabled process for placeholding the corrupted
> file in
> > >> a file queue
> > >>    If match ->
> > >>        UpdateAttribute ( iteration= ${iteration:plus(1)} ) -> looping
> back
> > >> to PutSFTP
> > >>
> > >> After 8992 iteration the file is corrupted. To test if the errors are
> in
> > >> the calculation of the SHA256 I have a copy of the flow without the
> > >> PUT/FETCH SFTP processors which haven't got any errors yet.
> > >>
> > >> It is very rare that we see these errors, millions of files are going
> > >> through without any issues but some time it happens which is not good.
> > >>
> > >> Can any one please help? Maybe trying to setup the same test and see
> if you
> > >> also have a corrupted file after some days.
> > >>
> > >> Kind regards
> > >> Jens M. Kofoed
> > >>
>

Reply via email to