Re: provenance

Brett Tiplitz Wed, 05 Oct 2016 10:20:11 -0700

Joe -

On the data side, the data get's written as it comes in, 2 more times on
processing, and then 1 more time with a mergeContent.  So I/O on the writes
is 4 times but only 3 really hit the commit to the file system.


The provenance on the other hand is another side of the coin.  Not much
get's written, but just fact-of records.  I looked at my worst case and I
think it's 13 for a flow uuid.  It's not my worst path, but it's got the
most data going through it.

The disk partitioning likely helps with QOS to ensure that flow and
provenance events are not held up by content writes, but if only I had QOS
available to me.  On the version, it says I'm running 0.6.1.  I just
switched from the old release that was years old, but I think the key to it
working before was that I had QOS on the disk i/o which I've now lost.

Also as a note, I had moved the provenance to volatile and the problem
disappeared.  Though that was not the desired outcome, so I just re-enabled
it again today.

brett

On Wed, Oct 5, 2016 at 12:42 PM, Joe Witt <joe.w...@gmail.com> wrote:

> NiFi only writes data to disk when it is actually changing the data.
> It is very uncommon to have a 10 processor flow where all or even most
> are actually touching the content.  You can take a look at the live
> status history data to see exactly how much content is being read from
> disk and written to disk precisely.  This makes it very easy to find
> where the heavy users of the underlying content repository - and disk
> - are.
>
> Even in the case of reads you should generally benefit from pretty
> excellent disk/OS caching.  Also, even if you're flow forks data and
> send it down multiple paths it is not actually creating copies.  Just
> creating new references.  NiFi will also automatically combine writes
> of events to the same file on disk within a short span of time and
> space.  This too helps with efficiency of disk utilization.  Key point
> there is efficiency of the content repository is pretty strong at this
> stage.  If you're using a version of NiFi that is years old then these
> things may not be true.
>
> Now, the run duration suggestion is about efficiency of the flow file
> repository which is the bookkeeping of the flowfiles (not the
> content).  We want you to be able to reduce how often we commit the
> session so run duration lets you choose your tolerance for delay while
> we automatically batch together sessions.
>
> So, key is to keep in mind that there are a few repositories and
> things (depending on your configuration) that will use disk:
> 1) Content repository (the bytes of the things you're reading/writing)
> 2) FlowFile repository (information about the flow files and their
> attributes - no content)
> 3) Provenance Repository
> 4) Logs
>
> All of these can be on different partitions and all can be across
> partitions and such.
>
> To really help with this particular case I think we'll need you to
> list out the processors involved (generically if necessary) and how
> much they read/write over a five minute period in steady state.  If
> there is really a chain of 10 processors and most are actually reading
> and writing content we can talk about additional strategies such as
> alternative composition of processors that will be more efficient.
>
> Thanks
> jOe
>
>
> On Wed, Oct 5, 2016 at 11:21 AM, Brett Tiplitz
> <brett.m.tipl...@systolic-inc.com> wrote:
> > I was always trying to understand the run duration.  I'm good on the
> > latency, so if it processes a bunch of events at once and my overall
> > throughput is the same, then it's ok.  I increased it to 100ms.  But I
> > looked at the bulk of my flow and this feature was only on 1 of the > 10
> > processors data goes through.
> >
> > I realize that slowing the rate of commits seems bad, but even the big
> guys
> > limit commits
> >
> >
> > On Wed, Oct 5, 2016 at 12:05 PM, Bryan Bende <bbe...@gmail.com> wrote:
> >>
> >> Brett,
> >>
> >> One thing that could possibly improve the performance here, although
> hard
> >> to say how much, is the concept of "Run Duration" on the processor
> >> scheduling tab. This is only available on processors marked with the
> >> @SupportsBatching annotation, so it depends what processors you are
> using.
> >>
> >> By increasing the run duration it lets the framework batch together all
> of
> >> the framework operations during that time period. The default setting
> is 0
> >> which means no batching by default, giving you the lowest latency per
> flow
> >> file, but users can choose to sacrifice some latency for higher
> throughput.
> >>
> >> I don't know enough about how provenance events are specifically
> >> committed, but I believe they would be tied to the session commits so
> that
> >> if a rollback occurred there wouldn't be unwanted events written.
> >>
> >> -Bryan
> >>
> >>
> >> On Wed, Oct 5, 2016 at 11:38 AM, Brett Tiplitz
> >> <brett.m.tipl...@systolic-inc.com> wrote:
> >>>
> >>> James -
> >>>
> >>> I believe the complication for me is both the number of objects as well
> >>> as the number of processors the data goes through.  I talked with a few
> >>> people and it sounds like NIFI writes each event out disk and then
> executes
> >>> a commit, which really does have a major impact on the performance.  I
> don't
> >>> have the liberty of resolving the disk performance, though I think I
> will
> >>> try moving the journals directory to /dev/shm.  I know on reboot I'll
> loose
> >>> data, but that is just like 1-2 times a year, so I think that loss is
> >>> acceptable.  Also, I'm not specifying anything on what data get's
> indexed so
> >>> it's what ever the default is.
> >>>
> >>> If I'm producing about 6000 (just a guess, though I think it's pretty
> >>> large) events per second, it would be nice if there was an option not
> to
> >>> perform a commit on every one of the 6000 items.  In reality, I would
> say a
> >>> commit should never occur more than once a second and that is likely
> way too
> >>> often.
> >>>
> >>> Last, is there a way to measure the actual provenance events going
> >>> through as I'm guessing on what it's actually doing here.
> >>>
> >>> brett
> >>>
> >>> On Fri, Sep 30, 2016 at 2:16 PM, James Wing <jvw...@gmail.com> wrote:
> >>>>
> >>>> Brett,
> >>>>
> >>>> The default provenance store, PersistentProvenanceRepository, does
> >>>> require I/O in proportion to flowfile events.  Flowfiles with many
> >>>> attributes, especially large attributes, are a frequent contributor to
> >>>> provenance overload because attribute state is tracked in provenance
> events.
> >>>> But this is different from flowfile content reads and writes, which
> use the
> >>>> separate content repository.  You might consider moving the provenance
> >>>> repository to a separate disk for additional I/O capacity.
> >>>>
> >>>> Does this sound relevant?  Can you share some details of your flow
> >>>> volumes and attribute sizes?
> >>>>
> >>>> nifi.provenance.repository.buffer.size is only used by the
> >>>> VolatileProvenanceRepository implementation, an in-memory provenance
> store.
> >>>> The property defines the size of the in-memory store.  The volatile
> store
> >>>> can avoid disk I/O issues, but at the expense of reduced provenance
> >>>> functionality.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> James
> >>>>
> >>>> On Thu, Sep 29, 2016 at 1:37 PM, Brett Tiplitz
> >>>> <brett.m.tipl...@systolic-inc.com> wrote:
> >>>>>
> >>>>> I'm having a throughput problem when processing data with Provenance
> >>>>> recording enabled.  I've pretty much disabled it, so I believe that
> is the
> >>>>> source of my issue.  On occasion, I get a message saying the flow is
> slowing
> >>>>> due to provenance recording.  I was running the out of the box
> configuration
> >>>>> for provenance.
> >>>>>
> >>>>> I believe the issue might be related to commit writes, though it's
> just
> >>>>> a theory.  There is a variable nifi.provenance.repository.
> buffer.size,
> >>>>> though I don't see anything about what that does.
> >>>>>
> >>>>> Any suggestions ?
> >>>>>
> >>>>> thanks,
> >>>>>
> >>>>> brett
> >>>>>
> >>>>> --
> >>>>> Brett Tiplitz
> >>>>> Systolic, Inc
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Brett Tiplitz
> >>> Systolic, Inc
> >>
> >>
> >
> >
> >
> > --
> > Brett Tiplitz
> > Systolic, Inc
>



-- 
Brett Tiplitz
Systolic, Inc

Re: provenance

Reply via email to