Re: NiFi Data Usage via Rest API

2018-07-26 Thread Mike Thomsen
Matt,

Our main use, which provenance data handles well, is figuring out **what**
data was handled. We drop everything but DROP out of convenience because we
have no known scenarios where data will be removed before it reaches the
end of the flow.

FWIW, this is what inspired the record stats processor we added about a
release ago. You can use it to drill down into a record set with path
operations and add attributes to a flowfile describing what the record set
therein contains. It should be particularly helpful to anyone who wants to
do partial or full table fetches and then say "count this batch of records
using this path."

Mike

On Wed, Jul 25, 2018 at 9:59 PM Matt Burgess  wrote:

> Mike, Ryan, Boris et al,
>
> I'd like to wrap my head around the kinds of use cases y'all have for
> provenance data in NiFi: what's good, what's bad, what we need to do
> to make things better. Are there questions you want to ask of
> provenance that you can't today? Do the DROP events give you what you
> need for your reporting, or would you benefit from some sort of
> "lifetime record" that might be generated from a NiFi subflow based on
> provenance events? I've been bouncing around the following
> concepts/improvements:
>
> 1) Let the ProcessSession keep track of the duration in a processor
> (for those that don't explicitly report it) [1]
> 2) Add a "durationNanos" field to provenance events, possibly
> replacing "durationMillis" in NiFi 2.0, to give better precision [2]
> 3) A processor to generate lineage when a DROP event is received
> (likely via the SiteToSiteProvenanceReportingTask), dependent on the
> persistence settings of the provenance repository
> 4) A "Query Filter" property on the SiteToSiteProvenanceReportingTask
> (or a separate reporting task if need be), perhaps leveraging Calcite
> for SQL filters or supporting the Lucene query language (since the
> prov events are indexed by Lucene)
>
> I still haven't come up with the New Feature Wiki page for graph tech
> (from a previous discussion on the list) but #3 above lends itself to
> also generating a lineage graph for a FlowFile, in some well-known
> format perhaps (Kryo, GraphML, etc.) I'll try to get that Wiki (and
> the discussion) going soon...
>
> Regards,
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-5420
> [2] https://issues.apache.org/jira/browse/NIFI-5463
> On Wed, Jul 25, 2018 at 9:18 PM Mike Thomsen 
> wrote:
> >
> > Ryan,
> >
> > Understandable. We haven't found a need for Beats or Forwarders here
> either because S2S gives everything you need to reliably ship the data.
> >
> > FWIW, if your need changes, I would recommend stripping down the
> provenance data. We cut out about 66-75% of the fields and dropped the
> intermediate records in favor of keeping DROP events for our simple
> dashboarding needs because we figured if a record never made it there
> something very bad happened.
> >
> > On Wed, Jul 25, 2018 at 8:54 PM Ryan H <
> ryan.howell.developm...@gmail.com> wrote:
> >>
> >> Thanks Mike for the suggestion on it. I'm looking for a solution that
> doesn't involve the additional components such as any
> Beats/Forwarders/Elasticsearch/etc.
> >>
> >> Boris, thanks for the link for the Monitoring introduction--I've
> checked it out multiple times. What I want to avoid is having the need for
> anything to be set on the Canvas and have the metrics collection via the
> rest api. I'm thinking that the api in the original question may be the way
> to go, but unsure of it without a little more information on the data model
> and how that data is collected/aggregated (such as what the data returned
> actually represents). I may just dig into the source if this email goes
> stale.
> >>
> >> -Ryan
> >>
> >>
> >> On Wed, Jul 25, 2018 at 9:17 AM, Boris Tyukin 
> wrote:
> >>>
> >>> Ryan, if you have not seen these posts from Pierre, I suggest starting
> there. He does a good job explaining different options
> >>> https://pierrevillard.com/2017/05/11/monitoring-nifi-introduction/
> >>>
> >>> I do agree that 5 minute thing is super confusing and pretty useless
> and you cannot change that interval. I think it is only useful to check
> quickly on your real-time pipelines at the moment.
> >>>
> >>> I wish NiFi provided nicer out of the box logging/monitoring
> capabilities but on a bright side, it seems to me that you can build your
> own and customize it as you want.
> >>>
> >>>
> >>> On Tue, Jul 24, 2018 at 10:55 PM Ryan H <
> ryan.howell.developm...@gmail.com> wrote:
> 
>  Hi All,
> 
>  I am looking for a way to obtain the total amount of data that has
> been processed by a running cluster for a period of time, ideally via the
> rest api.
> 
>  Example of my use case:
>  I have say 50 different process groups, each that have a connection
> to some data source. Each one is continuously pulling data in, doing
> something to it, then sending it out to some other external place. I'd like
> to programmatically gathe

RE: Unable to see Nifi data lineage in Atlas

2018-07-26 Thread Mohit
Hi, 

 

While looking at the logs, I found  out that ReportingLineageToAtlas is not
able to construct KafkaProducer. 

It throws the following logs - 

 

org.apache.kafka.common.KafkaException: Failed to construct kafka producer

at
org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:33
5)

at
org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:18
8)

at
org.apache.atlas.kafka.KafkaNotification.createProducer(KafkaNotification.ja
va:286)

at
org.apache.atlas.kafka.KafkaNotification.sendInternal(KafkaNotification.java
:207)

at
org.apache.atlas.notification.AbstractNotification.send(AbstractNotification
.java:84)

at
org.apache.atlas.hook.AtlasHook.notifyEntitiesInternal(AtlasHook.java:133)

at
org.apache.atlas.hook.AtlasHook.notifyEntities(AtlasHook.java:118)

at
org.apache.atlas.hook.AtlasHook.notifyEntities(AtlasHook.java:171)

at
org.apache.nifi.atlas.NiFiAtlasHook.commitMessages(NiFiAtlasHook.java:150)

at
org.apache.nifi.atlas.reporting.ReportLineageToAtlas.lambda$consumeNiFiProve
nanceEvents$6(ReportLineageToAtlas.java:721)

at
org.apache.nifi.reporting.util.provenance.ProvenanceEventConsumer.consumeEve
nts(ProvenanceEventConsumer.java:204)

at
org.apache.nifi.atlas.reporting.ReportLineageToAtlas.consumeNiFiProvenanceEv
ents(ReportLineageToAtlas.java:712)

at
org.apache.nifi.atlas.reporting.ReportLineageToAtlas.onTrigger(ReportLineage
ToAtlas.java:664)

at
org.apache.nifi.controller.tasks.ReportingTaskWrapper.run(ReportingTaskWrapp
er.java:41)

at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$
301(ScheduledThreadPoolExecutor.java:180)

at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Sch
eduledThreadPoolExecutor.java:294)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
49)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
24)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.IllegalArgumentException: No enum constant
org.apache.kafka.common.protocol.SecurityProtocol.PLAINTEXTSASL

at java.lang.Enum.valueOf(Enum.java:238)

at
org.apache.kafka.common.protocol.SecurityProtocol.valueOf(SecurityProtocol.j
ava:28)

at
org.apache.kafka.common.protocol.SecurityProtocol.forName(SecurityProtocol.j
ava:89)

at
org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:7
9)

at
org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:27
7)

... 20 common frames omitted

 

Thanks,

Mohit

 

From: Mohit  
Sent: 25 July 2018 17:46
To: users@nifi.apache.org
Subject: Unable to see Nifi data lineage in Atlas

 

Hi all,

 

I have configured ReportingLineageToAtlas reporting task to send Nifi flow
information to Atlas. Nifi is integrated with Ranger.

I am able to see all the information in the Atlas except the lineage. When I
search for hdfs_path or hive_table, I can only see the hive side
information. I can't figure out anything wrong in the configuration.

Is there something in the Ranger configuration that I'm missing?

 

Regards,

Mohit

 

 



Re: NiFi Data Usage via Rest API

2018-07-26 Thread Ryan H
Hi Matt,

The use case that I am investigating is fairly simplistic (and I may be
naive about it). I am only looking for the amount of data that has came in
to the cluster (across all PG's) and out of the cluster for a given time
period (or a way to derive based on a time period). I do not want to have
the requirement of adding reporting tasks, S2S, or anything on the canvas
to achieve this. I want to be able to just query the rest api to get this
information. I don't care about what the data is, only the amount of data
that has came in and went out.

For example: I would like to call the rest api every 4 hours to see how
much data has came in to the cluster and how much data has gone out of the
cluster (500 GB came in via external sources such as data lakes, message
queues, other api's, etc and 400 GB went out such as to other data lakes,
other clusters, applications, etc).

I was thinking that information was already available, but it is unclear to
me based on the rest api documentation (see original email). Is this
something you can speak to (rest-api:
nifi-api/flow/process-groups/root/status?recursive=true)
and the property values that are returned? With the data model, it is
unclear on what the values represent (only for last 5 minutes? running
counter since the start of time? if it is an aggregate of forever, when
will it reset? cluster restart?).

Data model in question:
ProcessGroupStatusEntity

{
"processGroupStatus": {
...
"aggregateSnapshot": {
...
"flowFilesIn": 0,
"bytesIn": 0,
"input": "value",
"flowFilesQueued": 0,
"bytesQueued": 0,
"queued": "value",
"queuedCount": "value",
"queuedSize": "value",
"bytesRead": 0,
"read": "value",
"bytesWritten": 0,
"written": "value",
"flowFilesOut": 0,
"bytesOut": 0,
"output": "value",
"flowFilesTransferred": 0,
"bytesTransferred": 0,
"transferred": "value",
   * "bytesReceived": 0,// I think this is the one, but not sure*
"flowFilesReceived": 0,
*"received": "value",*
   * "bytesSent": 0,   // I think this is the other one, but not sure*
"flowFilesSent": 0,
*"sent": "value",*
"activeThreadCount": 0,
"terminatedThreadCount": 0
},
"nodeSnapshots": [{…}]
},
"canRead": true
}



Thanks,

Ryan H



On Wed, Jul 25, 2018 at 9:59 PM, Matt Burgess  wrote:

> Mike, Ryan, Boris et al,
>
> I'd like to wrap my head around the kinds of use cases y'all have for
> provenance data in NiFi: what's good, what's bad, what we need to do
> to make things better. Are there questions you want to ask of
> provenance that you can't today? Do the DROP events give you what you
> need for your reporting, or would you benefit from some sort of
> "lifetime record" that might be generated from a NiFi subflow based on
> provenance events? I've been bouncing around the following
> concepts/improvements:
>
> 1) Let the ProcessSession keep track of the duration in a processor
> (for those that don't explicitly report it) [1]
> 2) Add a "durationNanos" field to provenance events, possibly
> replacing "durationMillis" in NiFi 2.0, to give better precision [2]
> 3) A processor to generate lineage when a DROP event is received
> (likely via the SiteToSiteProvenanceReportingTask), dependent on the
> persistence settings of the provenance repository
> 4) A "Query Filter" property on the SiteToSiteProvenanceReportingTask
> (or a separate reporting task if need be), perhaps leveraging Calcite
> for SQL filters or supporting the Lucene query language (since the
> prov events are indexed by Lucene)
>
> I still haven't come up with the New Feature Wiki page for graph tech
> (from a previous discussion on the list) but #3 above lends itself to
> also generating a lineage graph for a FlowFile, in some well-known
> format perhaps (Kryo, GraphML, etc.) I'll try to get that Wiki (and
> the discussion) going soon...
>
> Regards,
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-5420
> [2] https://issues.apache.org/jira/browse/NIFI-5463
> On Wed, Jul 25, 2018 at 9:18 PM Mike Thomsen 
> wrote:
> >
> > Ryan,
> >
> > Understandable. We haven't found a need for Beats or Forwarders here
> either because S2S gives everything you need to reliably ship the data.
> >
> > FWIW, if your need changes, I would recommend stripping down the
> provenance data. We cut out about 66-75% of the fields and dropped the
> intermediate records in favor of keeping DROP events for our simple
> dashboarding needs because we figured if a record never made it there
> something very bad happened.
> >
> > On Wed, Jul 25, 2018 at 8:54 PM Ryan H  gmail.com> wrote:
> >>
> >> Thanks Mike for the suggestion on it. I'm looking for a solution that
> doesn't involve the additional components such as any Beats/Forwarders/
> Elasticsearch/etc.
> >>
> >> Boris, thanks for the link for the Monitoring introduction--I've
> checked it out multiple times. What I want to avoid is having the need for
> anything to be se

High CPU load upon starting non-connected Out Port inside PG

2018-07-26 Thread Ken Tore Tallakstad
Hi,

First, thanks alot for a great product! :)

My issue is this. Create a PG, inside it create an out-port and connect it
to another out-port outside the PG. Start the out-port inside the PG. My
CPU load then sky-rockets (from ~5-10% to 200-300% on my laptop to
500-1000% on my servers) :/
If I however connect a processor, running or not (e.g the FlowFile
Generator) to the out-port inside the PG, CPU load returns to "normal".
Also if I just stop the running out-port inside the PG, with nothing
connected on the input side, all is normal.

Have not gotten around to looking at the thread-dump yet.

Ive tested this both on clustered and a clean standalone version of NiFi
1.7.1 (Inside docker contianer that is, but as far as I can tell, this does
not matter). Im on CentOS7.4 with Java 1.8_144.

Can anyone recreate this?

Cheers,

KT :)


Re: NiFi Data Usage via Rest API

2018-07-26 Thread Mark Payne
Hey Ryan,

The stats that you are seeing here is a rolling 5-minute window. The 
"bytesReceived" indicates the number of bytes that were received from external 
systems (i.e., the number of bytes reported as Provenance RECEIVE events). The 
"bytesSent' indicates the number of bytes that were sent to external systems 
(i.e., the number of bytes reported as Provenance SEND events). If you were to 
receive a file (or a datagram or a packet or a message or whatever) that is say 
10 MB, then send it to two destinations, then you'd see "bytesReceived" of 
10,000,000 and "bytesSent" of 20,000,000 (because the 10 MB were sent twice).

Because these are rolling windows, not tumbling windows, it would be difficult 
to use them to get exact counts over long periods of time. I do think the 
Provenance Data is the right place to look for that kind of information, but it 
may not be as trivial as you'd hope, by hitting a REST API.

At first thought, it may make sense to have another endpoint that does return 
counts for all data received/sent/etc. since the instance was started. This 
would make it much easier to query every 4 hours, for instance, and then 
subtract the two numbers to get the difference. This, however, would still be 
problematic if a node is restarted, because if you get the information at time 
t = 4 hours, then a node restarts at time t = 7 hours, and then you get the 
next snapshot at time t = 8 hours, you'll only have 1 hour worth of data from 
the node that restarted... this is one of the benefits of gleaning this info 
from Provenance data.

Alternatively, I suppose, some sort of persistent reporting mechanism could be 
built within NiFi itself to track this sort of thing, but nothing like that 
exists at the moment.

Thanks
-Mark

On Jul 26, 2018, at 9:12 AM, Ryan H 
mailto:ryan.howell.developm...@gmail.com>> 
wrote:

Hi Matt,

The use case that I am investigating is fairly simplistic (and I may be naive 
about it). I am only looking for the amount of data that has came in to the 
cluster (across all PG's) and out of the cluster for a given time period (or a 
way to derive based on a time period). I do not want to have the requirement of 
adding reporting tasks, S2S, or anything on the canvas to achieve this. I want 
to be able to just query the rest api to get this information. I don't care 
about what the data is, only the amount of data that has came in and went out.

For example: I would like to call the rest api every 4 hours to see how much 
data has came in to the cluster and how much data has gone out of the cluster 
(500 GB came in via external sources such as data lakes, message queues, other 
api's, etc and 400 GB went out such as to other data lakes, other clusters, 
applications, etc).

I was thinking that information was already available, but it is unclear to me 
based on the rest api documentation (see original email). Is this something you 
can speak to (rest-api: 
nifi-api/flow/process-groups/root/status?recursive=true) and the property 
values that are returned? With the data model, it is unclear on what the values 
represent (only for last 5 minutes? running counter since the start of time? if 
it is an aggregate of forever, when will it reset? cluster restart?).

Data model in question:
ProcessGroupStatusEntity

{
"processGroupStatus": {
...
"aggregateSnapshot": {
...
"flowFilesIn": 0,
"bytesIn": 0,
"input": "value",
"flowFilesQueued": 0,
"bytesQueued": 0,
"queued": "value",
"queuedCount": "value",
"queuedSize": "value",
"bytesRead": 0,
"read": "value",
"bytesWritten": 0,
"written": "value",
"flowFilesOut": 0,
"bytesOut": 0,
"output": "value",
"flowFilesTransferred": 0,
"bytesTransferred": 0,
"transferred": "value",
"bytesReceived": 0,// I think this is the one, but not sure
"flowFilesReceived": 0,
"received": "value",
"bytesSent": 0,   // I think this is the other one, but not sure
"flowFilesSent": 0,
"sent": "value",
"activeThreadCount": 0,
"terminatedThreadCount": 0
},
"nodeSnapshots": [{…}]
},
"canRead": true
}



Thanks,

Ryan H



On Wed, Jul 25, 2018 at 9:59 PM, Matt Burgess 
mailto:mattyb...@apache.org>> wrote:
Mike, Ryan, Boris et al,

I'd like to wrap my head around the kinds of use cases y'all have for
provenance data in NiFi: what's good, what's bad, what we need to do
to make things better. Are there questions you want to ask of
provenance that you can't today? Do the DROP events give you what you
need for your reporting, or would you benefit from some sort of
"lifetime record" that might be generated from a NiFi subflow based on
provenance events? I've been bouncing around the following
concepts/improvements:

1) Let the ProcessSession keep track of the duration in a processor
(for those that don't explicitly report it) [1]
2) Add a "durationNanos" field to provenance events, possibly
replacing "durationMillis" in NiFi 2.0,

Re: High CPU load upon starting non-connected Out Port inside PG

2018-07-26 Thread Mark Payne
KT,

I can confirm that this is the behavior I'm seeing as well. I went ahead and 
created a JIRA [1]
for this. I think the bug really is in the fact that we allow you to start the 
Port at all. Just like some
Processors are annotated as Requiring Input in order to be valid, ports should 
be too (unless they
are at the root group).

Thanks!
-Mark


[1] https://issues.apache.org/jira/browse/NIFI-5464

On Jul 26, 2018, at 9:22 AM, Ken Tore Tallakstad 
mailto:tallaks...@gmail.com>> wrote:

Hi,

First, thanks alot for a great product! :)

My issue is this. Create a PG, inside it create an out-port and connect it to 
another out-port outside the PG. Start the out-port inside the PG. My CPU load 
then sky-rockets (from ~5-10% to 200-300% on my laptop to 500-1000% on my 
servers) :/
If I however connect a processor, running or not (e.g the FlowFile Generator) 
to the out-port inside the PG, CPU load returns to "normal". Also if I just 
stop the running out-port inside the PG, with nothing connected on the input 
side, all is normal.

Have not gotten around to looking at the thread-dump yet.

Ive tested this both on clustered and a clean standalone version of NiFi 1.7.1 
(Inside docker contianer that is, but as far as I can tell, this does not 
matter). Im on CentOS7.4 with Java 1.8_144.

Can anyone recreate this?

Cheers,

KT :)



Re: NiFi Data Usage via Rest API

2018-07-26 Thread Ryan H
Hi Mark,

Thanks for the explanation on this; this is what I was looking for. So it
sounds like Provenance info is the way to go (as mentioned by Mike [thanks
Mike]). I will have to do a little more research on the Provenance events,
but it sounds like RECEIVE events are for when something is coming in to
the system from an external source and SEND events are when they are
leaving the system (I hope this is a correct assumption). I think I will
mull this over for a bit


-Ryan

On Thu, Jul 26, 2018 at 9:57 AM, Mark Payne  wrote:

> Hey Ryan,
>
> The stats that you are seeing here is a rolling 5-minute window. The
> "bytesReceived" indicates the number of bytes that were received from
> external systems (i.e., the number of bytes reported as Provenance RECEIVE
> events). The "bytesSent' indicates the number of bytes that were sent to
> external systems (i.e., the number of bytes reported as Provenance SEND
> events). If you were to receive a file (or a datagram or a packet or a
> message or whatever) that is say 10 MB, then send it to two destinations,
> then you'd see "bytesReceived" of 10,000,000 and "bytesSent" of 20,000,000
> (because the 10 MB were sent twice).
>
> Because these are rolling windows, not tumbling windows, it would be
> difficult to use them to get exact counts over long periods of time. I do
> think the Provenance Data is the right place to look for that kind of
> information, but it may not be as trivial as you'd hope, by hitting a REST
> API.
>
> At first thought, it may make sense to have another endpoint that does
> return counts for all data received/sent/etc. since the instance was
> started. This would make it much easier to query every 4 hours, for
> instance, and then subtract the two numbers to get the difference. This,
> however, would still be problematic if a node is restarted, because if you
> get the information at time t = 4 hours, then a node restarts at time t = 7
> hours, and then you get the next snapshot at time t = 8 hours, you'll only
> have 1 hour worth of data from the node that restarted... this is one of
> the benefits of gleaning this info from Provenance data.
>
> Alternatively, I suppose, some sort of persistent reporting mechanism
> could be built within NiFi itself to track this sort of thing, but nothing
> like that exists at the moment.
>
> Thanks
> -Mark
>
>
> On Jul 26, 2018, at 9:12 AM, Ryan H 
> wrote:
>
> Hi Matt,
>
> The use case that I am investigating is fairly simplistic (and I may be
> naive about it). I am only looking for the amount of data that has came in
> to the cluster (across all PG's) and out of the cluster for a given time
> period (or a way to derive based on a time period). I do not want to have
> the requirement of adding reporting tasks, S2S, or anything on the canvas
> to achieve this. I want to be able to just query the rest api to get this
> information. I don't care about what the data is, only the amount of data
> that has came in and went out.
>
> For example: I would like to call the rest api every 4 hours to see how
> much data has came in to the cluster and how much data has gone out of the
> cluster (500 GB came in via external sources such as data lakes, message
> queues, other api's, etc and 400 GB went out such as to other data lakes,
> other clusters, applications, etc).
>
> I was thinking that information was already available, but it is unclear
> to me based on the rest api documentation (see original email). Is this
> something you can speak to (rest-api: nifi-api/flow/process-groups/
> root/status?recursive=true) and the property values that are returned?
> With the data model, it is unclear on what the values represent (only for
> last 5 minutes? running counter since the start of time? if it is an
> aggregate of forever, when will it reset? cluster restart?).
>
> Data model in question:
> ProcessGroupStatusEntity
>
> {
> "processGroupStatus": {
> ...
> "aggregateSnapshot": {
> ...
> "flowFilesIn": 0,
> "bytesIn": 0,
> "input": "value",
> "flowFilesQueued": 0,
> "bytesQueued": 0,
> "queued": "value",
> "queuedCount": "value",
> "queuedSize": "value",
> "bytesRead": 0,
> "read": "value",
> "bytesWritten": 0,
> "written": "value",
> "flowFilesOut": 0,
> "bytesOut": 0,
> "output": "value",
> "flowFilesTransferred": 0,
> "bytesTransferred": 0,
> "transferred": "value",
>* "bytesReceived": 0,// I think this is the one, but not sure*
> "flowFilesReceived": 0,
> * "received": "value",*
>* "bytesSent": 0,   // I think this is the other one, but not sure*
> "flowFilesSent": 0,
> * "sent": "value",*
> "activeThreadCount": 0,
> "terminatedThreadCount": 0
> },
> "nodeSnapshots": [{…}]
> },
> "canRead": true
> }
>
>
>
> Thanks,
>
> Ryan H
>
>
>
> On Wed, Jul 25, 2018 at 9:59 PM, Matt Burgess 
> wrote:
>
>> Mike, Ryan, Boris et al,
>>
>> I'd like to wrap my head around the kinds of u

Re: NiFi Data Usage via Rest API

2018-07-26 Thread Mark Payne
Ryan,

That is correct. Would just clarify that when you say "SEND events are when 
they are leaving the system" -- the data is being
sent to an external system, but it is not being dropped from NiFi. So you could 
send the data to 10 different places. A "DROP"
event indicates that NiFi is now finished processing it.

Thanks
-Mark

On Jul 26, 2018, at 10:25 AM, Ryan H 
mailto:ryan.howell.developm...@gmail.com>> 
wrote:

Hi Mark,

Thanks for the explanation on this; this is what I was looking for. So it 
sounds like Provenance info is the way to go (as mentioned by Mike [thanks 
Mike]). I will have to do a little more research on the Provenance events, but 
it sounds like RECEIVE events are for when something is coming in to the system 
from an external source and SEND events are when they are leaving the system (I 
hope this is a correct assumption). I think I will mull this over for a bit


-Ryan

On Thu, Jul 26, 2018 at 9:57 AM, Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Hey Ryan,

The stats that you are seeing here is a rolling 5-minute window. The 
"bytesReceived" indicates the number of bytes that were received from external 
systems (i.e., the number of bytes reported as Provenance RECEIVE events). The 
"bytesSent' indicates the number of bytes that were sent to external systems 
(i.e., the number of bytes reported as Provenance SEND events). If you were to 
receive a file (or a datagram or a packet or a message or whatever) that is say 
10 MB, then send it to two destinations, then you'd see "bytesReceived" of 
10,000,000 and "bytesSent" of 20,000,000 (because the 10 MB were sent twice).

Because these are rolling windows, not tumbling windows, it would be difficult 
to use them to get exact counts over long periods of time. I do think the 
Provenance Data is the right place to look for that kind of information, but it 
may not be as trivial as you'd hope, by hitting a REST API.

At first thought, it may make sense to have another endpoint that does return 
counts for all data received/sent/etc. since the instance was started. This 
would make it much easier to query every 4 hours, for instance, and then 
subtract the two numbers to get the difference. This, however, would still be 
problematic if a node is restarted, because if you get the information at time 
t = 4 hours, then a node restarts at time t = 7 hours, and then you get the 
next snapshot at time t = 8 hours, you'll only have 1 hour worth of data from 
the node that restarted... this is one of the benefits of gleaning this info 
from Provenance data.

Alternatively, I suppose, some sort of persistent reporting mechanism could be 
built within NiFi itself to track this sort of thing, but nothing like that 
exists at the moment.

Thanks
-Mark


On Jul 26, 2018, at 9:12 AM, Ryan H 
mailto:ryan.howell.developm...@gmail.com>> 
wrote:

Hi Matt,

The use case that I am investigating is fairly simplistic (and I may be naive 
about it). I am only looking for the amount of data that has came in to the 
cluster (across all PG's) and out of the cluster for a given time period (or a 
way to derive based on a time period). I do not want to have the requirement of 
adding reporting tasks, S2S, or anything on the canvas to achieve this. I want 
to be able to just query the rest api to get this information. I don't care 
about what the data is, only the amount of data that has came in and went out.

For example: I would like to call the rest api every 4 hours to see how much 
data has came in to the cluster and how much data has gone out of the cluster 
(500 GB came in via external sources such as data lakes, message queues, other 
api's, etc and 400 GB went out such as to other data lakes, other clusters, 
applications, etc).

I was thinking that information was already available, but it is unclear to me 
based on the rest api documentation (see original email). Is this something you 
can speak to (rest-api: 
nifi-api/flow/process-groups/root/status?recursive=true) and the property 
values that are returned? With the data model, it is unclear on what the values 
represent (only for last 5 minutes? running counter since the start of time? if 
it is an aggregate of forever, when will it reset? cluster restart?).

Data model in question:
ProcessGroupStatusEntity

{
"processGroupStatus": {
...
"aggregateSnapshot": {
...
"flowFilesIn": 0,
"bytesIn": 0,
"input": "value",
"flowFilesQueued": 0,
"bytesQueued": 0,
"queued": "value",
"queuedCount": "value",
"queuedSize": "value",
"bytesRead": 0,
"read": "value",
"bytesWritten": 0,
"written": "value",
"flowFilesOut": 0,
"bytesOut": 0,
"output": "value",
"flowFilesTransferred": 0,
"bytesTransferred": 0,
"transferred": "value",
"bytesReceived": 0,// I think this is the one, but not sure
"flowFilesReceived": 0,
"received": "value",
"bytesSent": 0,   // I think this is the other one, but not sur

nifi-hive-nar-1.7.1.nar will not load

2018-07-26 Thread geoff.craig
Hello,

I did a clean install of NiFi 1.7.1 and nifi-hive-nar-1.7.1.nar will not
load.  It throws a Java error.  I had to remove it from the folder to get
NiFi running.  Is this a known issue?



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/


Re: nifi-hive-nar-1.7.1.nar will not load

2018-07-26 Thread Sivaprasanna
Geoff,

What error did you get? Please share it with us here.


On Thu, Jul 26, 2018 at 11:38 PM geoff.craig  wrote:

> Hello,
>
> I did a clean install of NiFi 1.7.1 and nifi-hive-nar-1.7.1.nar will not
> load.  It throws a Java error.  I had to remove it from the folder to get
> NiFi running.  Is this a known issue?
>
>
>
> --
> Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/
>


Re: nifi-hive-nar-1.7.1.nar will not load

2018-07-26 Thread Joe Witt
Geoff

Dont think it is a known issue.  Many of us tested the
build/startup/etc..  Can you share the log output?

Thanks

On Thu, Jul 26, 2018 at 2:08 PM, geoff.craig  wrote:
> Hello,
>
> I did a clean install of NiFi 1.7.1 and nifi-hive-nar-1.7.1.nar will not
> load.  It throws a Java error.  I had to remove it from the folder to get
> NiFi running.  Is this a known issue?
>
>
>
> --
> Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/


Re: nifi-hive-nar-1.7.1.nar will not load

2018-07-26 Thread geoff.craig
Here is the error:

2018-07-26 18:48:49,013 ERROR [main] org.apache.nifi.NiFi Failure to launch
NiFi due to java.util.ServiceConfigurationError:
org.apache.nifi.processor.Processor: Provider
org.apache.nifi.processors.hive.PutHiveStreaming could not be instantiated
java.util.ServiceConfigurationError: org.apache.nifi.processor.Processor:
Provider org.apache.nifi.processors.hive.PutHiveStreaming could not be
instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at
java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at
org.apache.nifi.nar.ExtensionManager.loadExtensions(ExtensionManager.java:148)
at
org.apache.nifi.nar.ExtensionManager.discoverExtensions(ExtensionManager.java:123)
at
org.apache.nifi.web.server.JettyServer.start(JettyServer.java:832)
at org.apache.nifi.NiFi.(NiFi.java:157)
at org.apache.nifi.NiFi.(NiFi.java:71)
at org.apache.nifi.NiFi.main(NiFi.java:292)
Caused by: org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY]
null
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:239)
at org.xerial.snappy.Snappy.(Snappy.java:48)
at
org.apache.nifi.processors.hive.PutHiveStreaming.(PutHiveStreaming.java:152)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at
java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 8 common frames omitted




--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/


Re: High CPU load upon starting non-connected Out Port inside PG

2018-07-26 Thread Ken Tore Tallakstad
Thanks, good to know! We had a rather complex flow and took us a while to
figure this one out :)

best

KT

tor. 26. jul. 2018 kl. 16:16 skrev Mark Payne :

> KT,
>
> I can confirm that this is the behavior I'm seeing as well. I went ahead
> and created a JIRA [1]
> for this. I think the bug really is in the fact that we allow you to start
> the Port at all. Just like some
> Processors are annotated as Requiring Input in order to be valid, ports
> should be too (unless they
> are at the root group).
>
> Thanks!
> -Mark
>
>
> [1] https://issues.apache.org/jira/browse/NIFI-5464
>
> On Jul 26, 2018, at 9:22 AM, Ken Tore Tallakstad 
> wrote:
>
> Hi,
>
> First, thanks alot for a great product! :)
>
> My issue is this. Create a PG, inside it create an out-port and connect it
> to another out-port outside the PG. Start the out-port inside the PG. My
> CPU load then sky-rockets (from ~5-10% to 200-300% on my laptop to
> 500-1000% on my servers) :/
> If I however connect a processor, running or not (e.g the FlowFile
> Generator) to the out-port inside the PG, CPU load returns to "normal".
> Also if I just stop the running out-port inside the PG, with nothing
> connected on the input side, all is normal.
>
> Have not gotten around to looking at the thread-dump yet.
>
> Ive tested this both on clustered and a clean standalone version of NiFi
> 1.7.1 (Inside docker contianer that is, but as far as I can tell, this does
> not matter). Im on CentOS7.4 with Java 1.8_144.
>
> Can anyone recreate this?
>
> Cheers,
>
> KT :)
>
>
>


Hive w/ Kerberos Authentication starts failing after a week

2018-07-26 Thread Peter Wicks (pwicks)
We are seeing frequent failures of our Hive DBCP connections after a week of 
use when using Kerberos with Principal/Keytab. We've tried with both the 
Credential Service and without (though in looking at the code, there should be 
no difference).

It looks like the tickets are expiring and renewal is not happening?

javax.security.sasl.SaslException: GSS initiate failed
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at 
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:204)
at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:176)
at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
at 
org.apache.commons.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38)
at 
org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:582)
at 
org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1148)
at 
org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106)
at 
org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044)
at 
org.apache.nifi.dbcp.hive.HiveConnectionPool.lambda$getConnection$0(HiveConnectionPool.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at 
org.apache.nifi.dbcp.hive.HiveConnectionPool.getConnection(HiveConnectionPool.java:355)
at sun.reflect.GeneratedMethodAccessor515.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)

Thanks,
Peter


Re: nifi-hive-nar-1.7.1.nar will not load

2018-07-26 Thread Matt Burgess
This has been biting a few users lately, not sure when it changed
exactly, but the Hive NAR uses a version of Snappy that tries to
extract the native Snappy library into a location pointed to by the
"java.io.tmpdir" variable, which IIRC is /tmp/. The /tmp
directory sometimes has a noexec restriction on it, and/or the OS user
running NiFi does not have permissions to read/write/execute that
directory. I haven't tried this workaround but I believe it has worked
for other folks: Add a line under your other "java.args.X" lines in
bootstrap.conf such as:

java.arg.snappy=-Dorg.xerial.snappy.tempdir=/path/to/nifi/lib/

Where the directory is the full path to NiFi's lib/ directory. This
will cause the native Snappy library to be extracted there, but since
the Snappy Java class is the only one attempting to load the library,
it shouldn't cause any issues by being there.

Have other folks run into this? I wonder if we should just add this
argument to bootstrap.conf to avoid any potential issues, but of
course we'd want to make sure that it doesn't introduce any issues
either.

Regards,
Matt
On Thu, Jul 26, 2018 at 2:49 PM geoff.craig  wrote:
>
> Here is the error:
>
> 2018-07-26 18:48:49,013 ERROR [main] org.apache.nifi.NiFi Failure to launch
> NiFi due to java.util.ServiceConfigurationError:
> org.apache.nifi.processor.Processor: Provider
> org.apache.nifi.processors.hive.PutHiveStreaming could not be instantiated
> java.util.ServiceConfigurationError: org.apache.nifi.processor.Processor:
> Provider org.apache.nifi.processors.hive.PutHiveStreaming could not be
> instantiated
> at java.util.ServiceLoader.fail(ServiceLoader.java:232)
> at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
> at
> java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
> at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
> at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
> at
> org.apache.nifi.nar.ExtensionManager.loadExtensions(ExtensionManager.java:148)
> at
> org.apache.nifi.nar.ExtensionManager.discoverExtensions(ExtensionManager.java:123)
> at
> org.apache.nifi.web.server.JettyServer.start(JettyServer.java:832)
> at org.apache.nifi.NiFi.(NiFi.java:157)
> at org.apache.nifi.NiFi.(NiFi.java:71)
> at org.apache.nifi.NiFi.main(NiFi.java:292)
> Caused by: org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY]
> null
> at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:239)
> at org.xerial.snappy.Snappy.(Snappy.java:48)
> at
> org.apache.nifi.processors.hive.PutHiveStreaming.(PutHiveStreaming.java:152)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at java.lang.Class.newInstance(Class.java:442)
> at
> java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
> ... 8 common frames omitted
>
>
>
>
> --
> Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/