Re: some questions about splits

Matt Burgess Wed, 24 Feb 2021 15:55:25 -0800

Geoffrey,

There's a really good blog by the man himself [1] :) I highly recommend the
official blog in general, lots of great posts and many are record-oriented
[2]


Regards,
Matt

[1] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
[2] https://blogs.apache.org/nifi/

On Wed, Feb 24, 2021 at 5:57 PM Greene (US), Geoffrey N <
geoffrey.n.gre...@boeing.com> wrote:

> Thank you for the fast response Mark.
>
>
>
> Hrm, record processing does sound useful.
>
>
>
> Are there any good blogs / documentation on this?  I’d really like to
> learn more.  I’ve been doing mostly text processing, as you’ve observed.
>
>
>
> My use case is something like this
>
> 1)      Use server API to get list of sensors
>
> 2)      Use server API to get list of jobs
>
> 3)      For each job, get count of frames/job.  There are up to 6 sets of
> frames/job, depending on which set you query for.
>
> 4)      Frames can only be queried 50 at a time, so for each 50, get set
> of time stamps of a frame
>
> 5)      For each time stamp, query for all sensor values at that time.
> These have to be queried one-at-a-time., because of the way the API works –
> one sensor value can only be given for one time.
>
> 6)      Glue all the data associated with that job (frame times, sensor
> readings,etc) together and paste in a big json. (There are more steps after
> that.
>
>
>
>
>
> Geoffrey Greene
>
> Associate Technical Fellow/Senior Software Ninjaneer
>
> (703) 414 2421
>
> The Boeing Company
>
>
>
> *From:* Mark Payne [mailto:marka...@hotmail.com]
> *Sent:* Wednesday, February 24, 2021 5:20 PM
> *To:* users@nifi.apache.org
> *Subject:* [EXTERNAL] Re: some questions about splits
>
>
>
> EXT email: be mindful of links/attachments.
>
>
>
>
> Geoffrey,
>
>
>
> At a high level, if you’re splitting multiple times and then trying to
> re-assemble everything, then yes I think your thought process is correct.
> But you’ve no doubt seen how complex and cumbersome this approach can be.
> It can also result in extremely poor performance. So much so that when I
> began creating a series of YouTube videos on NiFi Anti-Patterns, the first
> anti-pattern that I covered was the splitting and re-merging of data [1].
>
>
>
> Generally, this should be an absolute last resort, and Record-oriented
> processors should be used instead of splitting the data up and re-merging
> it. If you need to perform REST calls, you could do that with LookupRecord,
> and either use the RESTLookupService or if that doesn’t fit the bill
> exactly you could actually use the ScriptedLookupService and write a small
> script in Groovy or Python that would perform the REST call for you and
> return the results. Or perhaps the ScriptedTransformRecord would be more
> appropriate - hard to tell without knowing the exact use case.
>
>
>
> Obviously, your mileage may vary, but switching the data flow to use
> record-oriented processors, if possible, would typically yield a flow that
> is much simpler and yield throughput that is at least an order of magnitude
> better.
>
>
>
> But if for whatever reason you do end up being stuck with the split/merge
> approach - the key would likely be to consider backpressure heavily. If you
> have backpressure set to 10,000 FlowFiles (the default) and then you’re
> trying to merge together data, but the data comes from many different
> upstream splits, you can certainly end up in a situation like this, where
> you don’t have all of the data from a given ’split’ queued up. for
> MergeContent.
>
>
>
> Hope this helps!
>
> -Mark
>
>
>
> [1] https://www.youtube.com/watch?v=RjWstt7nRVY
>
>
>
>
>
> On Feb 24, 2021, at 4:59 PM, Greene (US), Geoffrey N <
> geoffrey.n.gre...@boeing.com> wrote:
>
>
>
> Im having some trouble with multiple splits/merges.  Here’s the idea:
>
>
>
>
>
> Big data -> split 1->Save all the fragment.*attributes into variables ->
> split 2-> save all the fragment.* attributes
>
>     |
>
> Split 1
>
>    |
>
> Save fragment.* attributes into split1.fragment.*
>
> |
>
> Split 2
>
> |
>
> Save fragment.* attributes into split2.fragment.* attributes
>
> |
>
> (More processing)
>
> |
>
> Split 3
>
> |
>
> Save fragment.* attributes into split3.fragment.* attributes
>
> |
>
> (other stuff)
>
> |
>
> Restore split3.fragment.* attributes to fragment.*
>
> |
>
> Merge3, using defragment strategy
>
> |
>
> Restore split2.fragment.* attributes to fragment.*
>
> |
>
> Merge 2 using defragment strategy
>
> |
>
> Restore split1.frragment.* attributes to fragment.*
>
> |
>
> Merge 1 using defragment strategy
>
>
>
> Am I thinking about this correctly?  It seems like sometimes, nifi is
> unable to do a merge on some of the split data (errors like “there are 50
> fragments, but we only found one).  Is it possible that I need to do some
> prioritization in the queues? I have noticed that my things do back up and
> the queues seem to fill up as its going through (several of the splits need
> to perform rest calls and processing, which can take time.  Maybe the issue
> is that one fragment “slips” through, before the others have even been
> processed far enough.  Is there an approved way to do this?
>
>
>
> Thanks for the help!
>
>
>

Re: some questions about splits

Reply via email to