from:"James Srinivasan"

Re: Nifi Publish/Consumer Kafka and Azure Event Hub

2018-09-05 Thread James Srinivasan

I've not tried this myself, but once you have a working JAAS config
(from 
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-quickstart-kafka-enabled-event-hubs#send-and-receive-messages-with-kafka-in-event-hubs),
set the corresponding protocol and mechanism properties in the NiFi
processor, and put the content of sasl.jaas.config in a file and
reference it from NiFi's bootstrap.conf as indicated by the second
link you found.

Good luck!
On Wed, 5 Sep 2018 at 15:54, João Henrique Freitas  wrote:
>
>
> Hello!
>
> I'm exploring Azure Event Hub with Kafka support. I know that's in preview.
>
> But I would like to know how to use PublishKafka with this configuration:
>
> https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-quickstart-kafka-enabled-event-hubs
>
> I don't know how to configure kafka processor with authentication parameters 
> like:
>
> security.protocol=SASL_SSL
> sasl.mechanism=PLAIN
> sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule 
> required username="$ConnectionString" 
> password="{YOUR.EVENTHUBS.CONNECTION.STRING}";
>
>
> Should I follow this?
>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka-1-0-nar/1.7.1/org.apache.nifi.processors.kafka.pubsub.PublishKafka_1_0/additionalDetails.html
>
> Or maybe its necessary do a patch in Publish/Consumer KafkaProcessor  ?
>
> Best regards,
>
> --
> João Henrique Ferreira de Freitas - joaohf_at_gmail.com
> Campinas-SP-Brasil

Demuxing NiFi logs

2018-09-07 Thread James Srinivasan

As we add more and more different processor groups doing separate things,
our nifi-app log is getting rather unmanageable. I'm probably missing
something obvious, but is one log per group possible?

Thanks

Re: Demuxing NiFi logs

2018-09-07 Thread James Srinivasan

Sure, we have several data flow managers who are each responsible for a
number of separate data flows, each of which lives in a separate processor
group. When debugging, being able to see what's happening in your processor
group (including full stack traces) is essential, but the noise from all
the other processors is hugely distracting.

On Fri, 7 Sep 2018, 19:52 Joe Witt,  wrote:

> James,
>
> One log per process group is not currently supported.  Through things
> like the message diagnostic context it might be pretty doable.
>
> That said, the logs aren't really meant to be the true historical
> record of value.  The data provenance entries are. Can you share more
> about what you're trying to accomplish?
>
> Thanks
> On Fri, Sep 7, 2018 at 2:47 PM James Srinivasan
>  wrote:
> >
> > As we add more and more different processor groups doing separate
> things, our nifi-app log is getting rather unmanageable. I'm probably
> missing something obvious, but is one log per group possible?
> >
> > Thanks
>

Re: Demuxing NiFi logs

2018-09-07 Thread James Srinivasan

You can already correlate logs per processor using the id, but that's a bit
clunky especially with multi line messages (eg stack traces). It might be
possible to add the processor group id to the logging context I guess.

Another use case is monitoring production flows for intermittent errors
when the bulletins might be missed. Provenance tells us what happened, but
the logs tell us why it happened (tho some processors populate extra
attributes). All too often we find the failure connections are
automatically terminated :-(

I've been putting off integrating NiFi into our ELK stack, but maybe I
ought to do that anyway.

I'll try to write up a jira.

On Fri, 7 Sep 2018, 20:14 Joe Witt,  wrote:

> Yeah makes sense.  We could either make the process group specific
> logs an option, or make tag based logs an option, or we could perhaps
> make bulletins something we store/make accessible for a longer period
> of time.  With the first two it is probably not a lot of code changes
> but then raw logs the user has to correlate to specific components in
> the flow.  The bulletin path is probably a good bit more code though a
> better user/operator experience.
>
> But your point/ask makes sense.
> On Fri, Sep 7, 2018 at 3:04 PM James Srinivasan
>  wrote:
> >
> > Sure, we have several data flow managers who are each responsible for a
> number of separate data flows, each of which lives in a separate processor
> group. When debugging, being able to see what's happening in your processor
> group (including full stack traces) is essential, but the noise from all
> the other processors is hugely distracting.
> >
> > On Fri, 7 Sep 2018, 19:52 Joe Witt,  wrote:
> >>
> >> James,
> >>
> >> One log per process group is not currently supported.  Through things
> >> like the message diagnostic context it might be pretty doable.
> >>
> >> That said, the logs aren't really meant to be the true historical
> >> record of value.  The data provenance entries are. Can you share more
> >> about what you're trying to accomplish?
> >>
> >> Thanks
> >> On Fri, Sep 7, 2018 at 2:47 PM James Srinivasan
> >>  wrote:
> >> >
> >> > As we add more and more different processor groups doing separate
> things, our nifi-app log is getting rather unmanageable. I'm probably
> missing something obvious, but is one log per group possible?
> >> >
> >> > Thanks
>

Generating Remote URL for InvokeHTTP

2018-11-16 Thread James Srinivasan

Hi all,

I'm observing some slightly unusual behaviour with my flow and wanted
to run a possible explanation past the list. I'm using NiFi to scrape
a website consisting of nested data

e.g. GET http://server/2018/16/11/  returns a webpage full of links to
today's data

I'm using a combination of InvokeHTTP (to traverse the hierarchy) and
GetHTMLElement (to extract file and directory links), starting at the
root i.e. http://server/, then walking the years, months, days etc.

I'm generating the Remote URLs as

${invokehttp.request.url}${HTMLElement}

where invokehttp.request.url is the URL previously fetched for the day
listing in the hierarchy, and HTMLElement is the link to the file
extracted by GetHTMLElement.

Finally, I've routed "retry" and "failure" back to the InvokeHTTP
processor since my network is quite flaky.

Mostly everything is ok, but sometimes I manage to generate URLs which
look a bit like this:

http://server/2018/16/11/filename.jsonfilename.json

i.e. the filename part of the URL is duplicated

My thesis is that this is occurring when there is a network issue, so
the flowfile is routed to retry, then the InvokeHTTP processor
re-evaluates the expression for the Remote URL which leads to the
duplication of the filename (since invokehttp.request.url will have
been updated by the failed request).

Does this sound feasible? My proposed fix for my flow is to use a
single attribute for the URL and UpdateAttribute before InvokeHTTP to
set this, so that any retries don't munge the URL.

Many thanks, hope this makes sense.

James

Re: stop processing related flowfiles

2018-11-28 Thread James Srinivasan

Hopefully you already know this:

"Kafka only provides a total order over records *within* a partition, not
between different partitions in a topic. Per-partition ordering combined
with the ability to partition data by key is sufficient for most
applications. However, if you require a total order over records this can
be achieved with a topic that has only one partition, though this will mean
only one consumer process per consumer group."

(From https://kafka.apache.org/documentation/)

On Wed, 28 Nov 2018, 13:55 Boris Tyukin  Hi guys,
>
> I am trying to come up with a good design for the following challenge:
>
> 1. ConsumeKafka processor consumes messages from 200 topics.
>
> 2. The next processor is a custom groovy processor that does some data
> transformation and also puts transformed data into target system.* It is
> crucial to process messages in topics in order. *
>
> 3. This is the tricky part - if any error is raised during step 2, I want
> to stop processing any new flowfiles for *that topic *while continue
> processing messages for other topics.
>
> In other words, I want to enforce order and process only one message per
> topic at the time and stop processing messages for a given topic, if
> currently processed message failed. It is kind like FIFO queue, that stops
> pushing items out of the queue if current item errors out.
>
> Is it possible to do it? I am open to use an external queue or cache like
> Redis.
>
> I've played a bit with EnforceOrder and Notify/Wait processors but still
> cannot wrap my head about it.
>
> Appreciate your help,
> Boris
>

NiFi JSON enrichment

2018-12-17 Thread James Srinivasan

Hi all,

I'm trying to enrich a data stream using NiFi. So far I have the following:

1) Stream of vehicle data in JSON format containing (id, make, model)
2) This vehicle data goes into HBase, using id as the row key and the
json data as the cell value (cf:json)
3) Stream of position data in JSON format, containing (id, lat, lon)
4) I extract the id from each of these items, then use FetchHBaseRow
to populate the hbase.row attribute with the json content
corresponding to that vehicle
5) I want to merge the NiFI attribute (which is actually JSON) into
the rest of the content, so I end up with (id, lat, lon, make, model).
This is where I am stuck - using the Jolt processor, I keep getting
unable to unmarshal json to an object

Caveats

1) I'm on NiFi 1.3
2) Much as I would like to use the new record functionality, I'm
trying to be schema agnostic as much as possible

Is this the right approach? Is there an easy way to add the attribute
value as a valid JSON object? Maybe ReplaceText capturing the trailing
} would work?

Thanks in advance,

James

Re: NiFi JSON enrichment

2018-12-18 Thread James Srinivasan

Good idea - this script is pretty close:

https://community.hortonworks.com/questions/75523/processor-for-replacing-json-values-dynamically-an.html

Thanks
On Mon, 17 Dec 2018 at 18:01, Andrew Grande  wrote:
>
> James,
>
> The easiest would be to merge json in a custom processor. Not easy as in no 
> work at all, but given your limitations with the NiFi version could be done 
> sooner maybe.
>
> Andrew
>
> On Mon, Dec 17, 2018, 9:53 AM James Srinivasan  
> wrote:
>>
>> Hi all,
>>
>> I'm trying to enrich a data stream using NiFi. So far I have the following:
>>
>> 1) Stream of vehicle data in JSON format containing (id, make, model)
>> 2) This vehicle data goes into HBase, using id as the row key and the
>> json data as the cell value (cf:json)
>> 3) Stream of position data in JSON format, containing (id, lat, lon)
>> 4) I extract the id from each of these items, then use FetchHBaseRow
>> to populate the hbase.row attribute with the json content
>> corresponding to that vehicle
>> 5) I want to merge the NiFI attribute (which is actually JSON) into
>> the rest of the content, so I end up with (id, lat, lon, make, model).
>> This is where I am stuck - using the Jolt processor, I keep getting
>> unable to unmarshal json to an object
>>
>> Caveats
>>
>> 1) I'm on NiFi 1.3
>> 2) Much as I would like to use the new record functionality, I'm
>> trying to be schema agnostic as much as possible
>>
>> Is this the right approach? Is there an easy way to add the attribute
>> value as a valid JSON object? Maybe ReplaceText capturing the trailing
>> } would work?
>>
>> Thanks in advance,
>>
>> James

Re: NiFi JSON enrichment

2018-12-18 Thread James Srinivasan

Yup, my example used made-up fields to keep it simple. In reality I
have between 20 and 80 fields per schema, with some nesting, arrays
etc.

It might be useful if I explained what I'm currently doing and why I'm
not using the record approach:

I've got c.20 different data streams in protobuf [1] format, which I
need to put into GeoMesa [2] and ElasticSearch. The best format for
the latter two is JSON.
I've written my own processor [3] to convert from protobuf to the
canonical protobuf encoding as JSON. Unlike Avro, protobuf data does
not contain the schema, so this processor requires the .proto schema.
Finally, I've written GeoMesa converters from JSON into what is
required by GeoMesa (ElasticSearch just works). These converters are
actually automatically generated from the .proto schema.

So far, so good. For enrichment, I can pull out various JSON elements,
look them up in HBase and (thanks to Andrew/Matt) merge the results
back into the outgoing JSON.

The record approach would allow me to do all the above, but in a more
strongly typed way, and would probably be more performant. However,
I'd have to write and maintain a NiFi Record Schema (Avro schema?) for
each of the .proto schemas (assuming that is possible) which seemed an
overhead for little potential gain.

I could instead convert protobuf to Avro at the start, but that seemed
non-trivial. I guess the main underlying question is how NiFi Record
Schemas are meant to work when the source data already has its own
schema definition language (right now this is only Avro?)

Hope this makes some sense, I'm certainly not against using Records
given more time & effort.

James

[1] https://developers.google.com/protocol-buffers/
[2] https://geomesa.org
[3] Which I hope to contribute back
On Mon, 17 Dec 2018 at 18:24, Bryan Bende  wrote:
>
> I know you mentioned staying schema agnostic, but if you went with the
> record approach then this sounds like a good fit for the HBase lookup
> service.
>
> Steps 3-5 would be using LookupRecord with an HBaseLookupService where
> you lookup by row id, and put the results into the current record.
>
> I'm not sure if your example used made up fields, but if not, then
> you'd just need a schema that had the 5 fields defined.
> On Mon, Dec 17, 2018 at 1:01 PM Andrew Grande  wrote:
> >
> > James,
> >
> > The easiest would be to merge json in a custom processor. Not easy as in no 
> > work at all, but given your limitations with the NiFi version could be done 
> > sooner maybe.
> >
> > Andrew
> >
> > On Mon, Dec 17, 2018, 9:53 AM James Srinivasan  
> > wrote:
> >>
> >> Hi all,
> >>
> >> I'm trying to enrich a data stream using NiFi. So far I have the following:
> >>
> >> 1) Stream of vehicle data in JSON format containing (id, make, model)
> >> 2) This vehicle data goes into HBase, using id as the row key and the
> >> json data as the cell value (cf:json)
> >> 3) Stream of position data in JSON format, containing (id, lat, lon)
> >> 4) I extract the id from each of these items, then use FetchHBaseRow
> >> to populate the hbase.row attribute with the json content
> >> corresponding to that vehicle
> >> 5) I want to merge the NiFI attribute (which is actually JSON) into
> >> the rest of the content, so I end up with (id, lat, lon, make, model).
> >> This is where I am stuck - using the Jolt processor, I keep getting
> >> unable to unmarshal json to an object
> >>
> >> Caveats
> >>
> >> 1) I'm on NiFi 1.3
> >> 2) Much as I would like to use the new record functionality, I'm
> >> trying to be schema agnostic as much as possible
> >>
> >> Is this the right approach? Is there an easy way to add the attribute
> >> value as a valid JSON object? Maybe ReplaceText capturing the trailing
> >> } would work?
> >>
> >> Thanks in advance,
> >>
> >> James

Re: NiFi JSON enrichment

2019-01-03 Thread James Srinivasan

Hi Austin,

We did consider enriching records in the GeoMesa converter, but we
also need the enriched records for other destinations e.g.
ElasticSearch, hence we were keen to keep it in NiFi. As suggested, I
think I'll write a little Groovy script to merge JSON (looked up from
HBase) from NiFi attributes into the JSON flow file content.

Thanks,

James

On Thu, 27 Dec 2018 at 22:01, Austin Heyne  wrote:
>
> James,
>
> A little late to the show but hopefully this is useful.
>
> What we typically do for data enrichment is we'll use an EvaluateJsonPath 
> processor to pull JSON fields out into attributes under a common key, e.g. 
> foo.model. We then have a PutRedis processor that grabs everything under foo 
> and adds them to a record keyed under an identifier for that feature. We then 
> use a redis enrichment cache in the converter [1] to fill in out missing 
> fields.
>
> For your use case I'd probably stream the vehicle data into a redis cache and 
> then have the position updates hit that cache when they're actually being 
> converted.
>
> -Austin
>
> [1] https://www.geomesa.org/documentation/user/convert/cache.html#
>
>
> On 12/18/2018 07:54 PM, Mike Thomsen wrote:
>
> James,
>
> Only skimmed this, but this looks like it might provide some interesting 
> ideas on how to transition from Protobuf to Avro:
>
> https://gist.github.com/alexvictoor/1d3937f502c60318071f
>
> Mike
>
> On Tue, Dec 18, 2018 at 3:07 PM Otto Fowler  wrote:
>>
>> What would be really cool would be if you could also load the registry with 
>> your .protos somehow, and configure using the proto names, and then just 
>> have your registry convert them on demand
>>
>>
>> On December 18, 2018 at 15:04:30, Otto Fowler (ottobackwa...@gmail.com) 
>> wrote:
>>
>> You could implement a custom schema registry that converts the protos to 
>> schema on the fly and caches.
>>
>>
>> On December 18, 2018 at 13:55:47, James Srinivasan 
>> (james.sriniva...@gmail.com) wrote:
>>
>> Yup, my example used made-up fields to keep it simple. In reality I
>> have between 20 and 80 fields per schema, with some nesting, arrays
>> etc.
>>
>> It might be useful if I explained what I'm currently doing and why I'm
>> not using the record approach:
>>
>> I've got c.20 different data streams in protobuf [1] format, which I
>> need to put into GeoMesa [2] and ElasticSearch. The best format for
>> the latter two is JSON.
>> I've written my own processor [3] to convert from protobuf to the
>> canonical protobuf encoding as JSON. Unlike Avro, protobuf data does
>> not contain the schema, so this processor requires the .proto schema.
>> Finally, I've written GeoMesa converters from JSON into what is
>> required by GeoMesa (ElasticSearch just works). These converters are
>> actually automatically generated from the .proto schema.
>>
>> So far, so good. For enrichment, I can pull out various JSON elements,
>> look them up in HBase and (thanks to Andrew/Matt) merge the results
>> back into the outgoing JSON.
>>
>> The record approach would allow me to do all the above, but in a more
>> strongly typed way, and would probably be more performant. However,
>> I'd have to write and maintain a NiFi Record Schema (Avro schema?) for
>> each of the .proto schemas (assuming that is possible) which seemed an
>> overhead for little potential gain.
>>
>> I could instead convert protobuf to Avro at the start, but that seemed
>> non-trivial. I guess the main underlying question is how NiFi Record
>> Schemas are meant to work when the source data already has its own
>> schema definition language (right now this is only Avro?)
>>
>> Hope this makes some sense, I'm certainly not against using Records
>> given more time & effort.
>>
>> James
>>
>> [1] https://developers.google.com/protocol-buffers/
>> [2] https://geomesa.org
>> [3] Which I hope to contribute back
>> On Mon, 17 Dec 2018 at 18:24, Bryan Bende  wrote:
>> >
>> > I know you mentioned staying schema agnostic, but if you went with the
>> > record approach then this sounds like a good fit for the HBase lookup
>> > service.
>> >
>> > Steps 3-5 would be using LookupRecord with an HBaseLookupService where
>> > you lookup by row id, and put the results into the current record.
>> >
>> > I'm not sure if your example used made up fields, but if not, then
>> > you'd just need a schema that had the 5 fields defined.
>> > On Mon, Dec 17, 2018 a

Re: expression failure in URL concatenation?

2019-01-31 Thread James Srinivasan

Out of interest, does the URL containing the double slash work?

On Thu, 31 Jan 2019, 21:42 l vic  I am using processor group variable as base part of my URL:
> REST_URL=http://localhost:8080/nifi-api
>
> I am trying to append second part of URL in InvokeHTTP regardless if
> REST_URL ends with '/', or not so that concatenation of "
> http://localhost:8080/nifi-api/";, or "http://localhost:8080/nifi-api";
> with "resources" return the same URL: "
> http://localhost:8080/nifi-api/resources":
>
>
> *${${REST_URL}:endsWith('/'):ifElse('${REST_URL}resources','${REST_URL}/resources')}*
>
>
> But the expression always appends '/', so in the first case I end up with 
> *"http://localhost:8080/nifi-api//resources
> "*... Any idea where the error
> is?
>
> Thank you,
>
>
>

Re: Is the DistributedMapCacheService a single point of failure?

2019-02-12 Thread James Srinivasan

We switched to HBase_1_1_2_ClientMapCacheService for precisely this
reason. It works great (we already had HBase which probably helped)

On Tue, 12 Feb 2019 at 12:51, Vos, Walter  wrote:
>
> Hi,
>
> I'm on NiFi 1.5 and we're currently having an issue with one of the nodes in 
> our three node cluster. No biggie, just disconnect it from the cluster and 
> let the other two nodes run things for a while, right? Unfortunately, some of 
> our flows are using a DistributedMapCacheService that have that particular 
> node that we took out set as the server hostname. For me as an admin, this is 
> worrying :-)
>
> Is there anything I can do in terms of configuration to "clusterize" the 
> DistributedMapCacheServices? I can already see that the 
> DistributedMapCacheServer doesn't define a hostname, so I guess that runs on 
> all nodes. Can we set multiple hostnames in the DistributedMapCacheService 
> then? Or should I just change it over in case of node failure? Is the cache 
> shared among the cluster? I.e. do all nodes have the same values for each 
> signal identifier/counter name?
>
> Kind regards,
>
> Walter
>
> 
>
> Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor 
> (gebruik door) de geadresseerde. De e-mail kan persoonlijke of vertrouwelijke 
> informatie bevatten. Openbaarmaking, vermenigvuldiging, verspreiding en/of 
> verstrekking van (de inhoud van) deze e-mail (en eventuele bijlagen) aan 
> derden is uitdrukkelijk niet toegestaan. Indien u niet de bedoelde 
> geadresseerde bent, wordt u vriendelijk verzocht degene die de e-mail verzond 
> hiervan direct op de hoogte te brengen en de e-mail (en eventuele bijlagen) 
> te vernietigen.
>
> Informatie vennootschap

Re: Is the DistributedMapCacheService a single point of failure?

2019-02-12 Thread James Srinivasan

(you can make it slightly less yucky by persisting the cache to shared
storage so you don't lose the contents when another node starts up,
but you do have to manually poke the clients)

On Tue, 12 Feb 2019 at 14:06, Bryan Bende  wrote:
>
> As James pointed out, there are alternate implementations of the DMC
> client that use external services that can be configured for high
> availability, such as HBase or Redis.
>
> When using the DMC client service, which is meant to work with the DMC
> server, the server is a single point of failure. In a cluster, the
> server runs on all nodes, but it doesn't replicate data between them,
> and the client can only point at one of these nodes. If you have to
> switch the client to point at a new server, then the cache will be
> starting over on the new server.
>
> On Tue, Feb 12, 2019 at 8:11 AM James Srinivasan
>  wrote:
> >
> > We switched to HBase_1_1_2_ClientMapCacheService for precisely this
> > reason. It works great (we already had HBase which probably helped)
> >
> > On Tue, 12 Feb 2019 at 12:51, Vos, Walter  wrote:
> > >
> > > Hi,
> > >
> > > I'm on NiFi 1.5 and we're currently having an issue with one of the nodes 
> > > in our three node cluster. No biggie, just disconnect it from the cluster 
> > > and let the other two nodes run things for a while, right? Unfortunately, 
> > > some of our flows are using a DistributedMapCacheService that have that 
> > > particular node that we took out set as the server hostname. For me as an 
> > > admin, this is worrying :-)
> > >
> > > Is there anything I can do in terms of configuration to "clusterize" the 
> > > DistributedMapCacheServices? I can already see that the 
> > > DistributedMapCacheServer doesn't define a hostname, so I guess that runs 
> > > on all nodes. Can we set multiple hostnames in the 
> > > DistributedMapCacheService then? Or should I just change it over in case 
> > > of node failure? Is the cache shared among the cluster? I.e. do all nodes 
> > > have the same values for each signal identifier/counter name?
> > >
> > > Kind regards,
> > >
> > > Walter
> > >
> > > 
> > >
> > > Deze e-mail, inclusief eventuele bijlagen, is uitsluitend bestemd voor 
> > > (gebruik door) de geadresseerde. De e-mail kan persoonlijke of 
> > > vertrouwelijke informatie bevatten. Openbaarmaking, vermenigvuldiging, 
> > > verspreiding en/of verstrekking van (de inhoud van) deze e-mail (en 
> > > eventuele bijlagen) aan derden is uitdrukkelijk niet toegestaan. Indien u 
> > > niet de bedoelde geadresseerde bent, wordt u vriendelijk verzocht degene 
> > > die de e-mail verzond hiervan direct op de hoogte te brengen en de e-mail 
> > > (en eventuele bijlagen) te vernietigen.
> > >
> > > Informatie vennootschap<http://www.ns.nl/emaildisclaimer>

Re: Tailor logback.xml custom Appender and Logging

2019-02-25 Thread James Srinivasan

I suggested something similar here:

http://apache-nifi-users-list.2361937.n4.nabble.com/Demuxing-NiFi-logs-td5689.html

On Mon, 25 Feb 2019 at 12:11, James McMahon  wrote:
>
> A very helpful approach to redirecting logging for a particular processor 
> type is presented here: 
> https://community.hortonworks.com/questions/63071/in-apache-ni-fi-how-can-i-log-all-the-flowfile-att.html
>  . It shows how we can tailor logback.xml to redirect logging for a given 
> processor.
>
> But this appears to redirect logging for every single instance of that 
> processor type globally across the nifi service. This would not be ideal: for 
> example, clearly others employing LogAttribute processors in their process 
> groups for their flows may not want their logging redirected to my chosen log 
> file for my LogAttribute instance.
>
> Has anyone figured out how to use the approach described by the link above - 
> namely, add a custom Appender and Logger in logback.xml - to tailor the 
> customization to a specific processor instance? Or perhaps even limit it to a 
> particular Process Group?
>
> Thank you.

Re: Tailor logback.xml custom Appender and Logging

2019-02-25 Thread James Srinivasan

I created a JIRA to log this here:

https://issues.apache.org/jira/browse/NIFI-6079

Please do comment - not sure my suggested approach is best

On Mon, 25 Feb 2019 at 20:19, Andy LoPresto  wrote:
>
> Hate to be “that guy”, but the simplest thing I can think of right now is set 
> up a NiFi flow that tails nifi-app.log, looks for messages with the specific 
> component ID, and puts them into another log (either MergeContent & PutFile, 
> or ExecuteScript / ExecuteStreamCommand to just echo them out). I do not 
> believe you can identify a specific instance of a processor via component ID 
> in the logback settings.
>
> Andy LoPresto
> alopre...@apache.org
> alopresto.apa...@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Feb 25, 2019, at 4:10 AM, James McMahon  wrote:
>
> A very helpful approach to redirecting logging for a particular processor 
> type is presented here: 
> https://community.hortonworks.com/questions/63071/in-apache-ni-fi-how-can-i-log-all-the-flowfile-att.html
>  . It shows how we can tailor logback.xml to redirect logging for a given 
> processor.
>
> But this appears to redirect logging for every single instance of that 
> processor type globally across the nifi service. This would not be ideal: for 
> example, clearly others employing LogAttribute processors in their process 
> groups for their flows may not want their logging redirected to my chosen log 
> file for my LogAttribute instance.
>
> Has anyone figured out how to use the approach described by the link above - 
> namely, add a custom Appender and Logger in logback.xml - to tailor the 
> customization to a specific processor instance? Or perhaps even limit it to a 
> particular Process Group?
>
> Thank you.
>
>

Re: [EXTERNAL] Re: PublishKafka_1_0 Kerberos

2019-02-26 Thread James Srinivasan

If someone is updating the Kafka docs, it would be great to get this
corrected:

http://apache-nifi-users-list.2361937.n4.nabble.com/Incorrect-PublishKafka-0-10-documentation-td3406.html

(Been on my to-do list for ages)

On Tue, 26 Feb 2019, 16:16 Bryan Bende,  wrote:

> Hi Dan,
>
> There is a note in there somewhere that says it doesn't have to be
> specified in the JAAS file...
>
> "Alternatively, the JAAS configuration when using GSSAPI can be
> provided by specifying the Kerberos Principal and Kerberos Keytab
> directly in the processor properties. This will dynamically create a
> JAAS configuration like above, and will take precedence over the
> java.security.auth.login.config system property. "
>
> We could also update that to mention that the Keytab Credential
> Service can take precedence over the principal and keytab directly in
> the processor.
>
> Thanks,
>
> Bryan
>
> On Tue, Feb 26, 2019 at 11:13 AM Dan Caulfield 
> wrote:
> >
> > The additional details link within the processor usage docs still
> references using the JAAS file.
> >
> >
> >
> > From: Jeremy Dyer [mailto:jdy...@gmail.com]
> > Sent: Tuesday, February 26, 2019 10:00 AM
> > To: users@nifi.apache.org
> > Subject: [EXTERNAL] Re: PublishKafka_1_0 Kerberos
> >
> >
> >
> > A JAAS file is not required. Can you point me to the documentation that
> is confusing so I can change that as well
> >
> >
> >
> > On Tue, Feb 26, 2019 at 7:57 AM Dan Caulfield 
> wrote:
> >
> > Is a JAAS file still required when using SASL_PLAINTEXT or SASL_SSL?
> Since the Kerberos Credential Service and the Processor itself has all of
> the information that the JAAS file would contain it seems to contradict the
> documentation.
> >
> >
> >
> >
> > Dan Caulfield
> > Hyper Scale Engineer
> >
> > GM – NA Information Technology – Hyper Scale Data Solutions
> >
> > dan.caulfi...@gm.com
> > T +1 5128403497
> >
> >
> >
> >
> >
> > Nothing in this message is intended to constitute an electronic
> signature unless a specific statement to the contrary is included in this
> message.
> >
> > Confidentiality Note: This message is intended only for the person or
> entity to which it is addressed. It may contain confidential and/or
> privileged material. Any review, transmission, dissemination or other use,
> or taking of any action in reliance upon this message by persons or
> entities other than the intended recipient is prohibited and may be
> unlawful. If you received this message in error, please contact the sender
> and delete it from your computer.
> >
> >
> >
> > Nothing in this message is intended to constitute an electronic
> signature unless a specific statement to the contrary is included in this
> message.
> >
> > Confidentiality Note: This message is intended only for the person or
> entity to which it is addressed. It may contain confidential and/or
> privileged material. Any review, transmission, dissemination or other use,
> or taking of any action in reliance upon this message by persons or
> entities other than the intended recipient is prohibited and may be
> unlawful. If you received this message in error, please contact the sender
> and delete it from your computer.
>

Re: Different NiFi Node sizes within same cluster

2019-03-06 Thread James Srinivasan

Yes, we hit this with the new load balanced queues (which, to be fair, we
also had with remote process groups previously). Two "old" nodes got
saturated and their queues filled while three "new" nodes were fine.

My "solution" was to move everything to new hardware which we had inbound
anyway.

On Wed, 6 Mar 2019, 20:40 Jon Logan,  wrote:

> You may run into issues with different processing power, as some machines
> may be overwhelmed in order to saturate other machines.
>
> On Wed, Mar 6, 2019 at 3:34 PM Mark Payne  wrote:
>
>> Chad,
>>
>> This should not be a problem, given that all nodes have enough storage
>> available to handle the influx of data.
>>
>> Thanks
>> -Mark
>>
>>
>> > On Mar 6, 2019, at 1:44 PM, Chad Woodhead 
>> wrote:
>> >
>> > Are there any negative effects of having filesystem mounts (dedicated
>> mounts for each repo) used by the different NiFi repositories differ in
>> size on NiFi nodes within the same cluster? For instance, if some nodes
>> have a content_repo mount of 130 GB and other nodes have a content_repo
>> mount of 125 GB, could that cause any problems or cause one node to be used
>> more since it has more space? What about if the difference was larger, by
>> say a 100 GB difference?
>> >
>> > Trying to repurpose old nodes and add them as NiFi nodes, but their
>> mount sizes are different than my current cluster’s nodes and I’ve noticed
>> I can’t set the max size limit to use of a particular mount for a repo.
>> >
>> > -Chad
>>
>>

Re: Different NiFi Node sizes within same cluster

2019-03-06 Thread James Srinivasan

In our case, backpressure applied all the way up to the TCP network
source which meant we lost data. AIUI, current load balancing is round
robin (and two other options prob not relevant). Would actual load
balancing (e.g. send to node with lowest OS load, or number of active
threads) be a reasonable request?

On Wed, 6 Mar 2019 at 20:51, Joe Witt  wrote:
>
> This is generally workable (heterogenous node capabilities) in NiFi 
> clustering.  But you do want to leverage back-pressure and load balanced 
> connections so that faster nodes will have an opportunity to take on the 
> workload for slower nodes.
>
> Thanks
>
> On Wed, Mar 6, 2019 at 3:48 PM James Srinivasan  
> wrote:
>>
>> Yes, we hit this with the new load balanced queues (which, to be fair, we 
>> also had with remote process groups previously). Two "old" nodes got 
>> saturated and their queues filled while three "new" nodes were fine.
>>
>> My "solution" was to move everything to new hardware which we had inbound 
>> anyway.
>>
>> On Wed, 6 Mar 2019, 20:40 Jon Logan,  wrote:
>>>
>>> You may run into issues with different processing power, as some machines 
>>> may be overwhelmed in order to saturate other machines.
>>>
>>> On Wed, Mar 6, 2019 at 3:34 PM Mark Payne  wrote:
>>>>
>>>> Chad,
>>>>
>>>> This should not be a problem, given that all nodes have enough storage 
>>>> available to handle the influx of data.
>>>>
>>>> Thanks
>>>> -Mark
>>>>
>>>>
>>>> > On Mar 6, 2019, at 1:44 PM, Chad Woodhead  wrote:
>>>> >
>>>> > Are there any negative effects of having filesystem mounts (dedicated 
>>>> > mounts for each repo) used by the different NiFi repositories differ in 
>>>> > size on NiFi nodes within the same cluster? For instance, if some nodes 
>>>> > have a content_repo mount of 130 GB and other nodes have a content_repo 
>>>> > mount of 125 GB, could that cause any problems or cause one node to be 
>>>> > used more since it has more space? What about if the difference was 
>>>> > larger, by say a 100 GB difference?
>>>> >
>>>> > Trying to repurpose old nodes and add them as NiFi nodes, but their 
>>>> > mount sizes are different than my current cluster’s nodes and I’ve 
>>>> > noticed I can’t set the max size limit to use of a particular mount for 
>>>> > a repo.
>>>> >
>>>> > -Chad
>>>>

Re: Different NiFi Node sizes within same cluster

2019-03-06 Thread James Srinivasan

Yup, but because of the unfortunate way the source (outside NiFi)
works, it doesn't buffer for long when the connection doesn't pull or
drops. It behaves far more like a 5 Mbps UDP stream really :-(

On Wed, 6 Mar 2019 at 21:44, Bryan Bende  wrote:
>
> James, just curious, what was your source processor in this case? ListenTCP?
>
> On Wed, Mar 6, 2019 at 4:26 PM Jon Logan  wrote:
> >
> > What really would resolve some of these issues is backpressure on CPU -- 
> > ie. let Nifi throttle itself down to not choke the machine until it dies if 
> > constrained on CPU. Easier said than done unfortunately.
> >
> > On Wed, Mar 6, 2019 at 4:23 PM James Srinivasan 
> >  wrote:
> >>
> >> In our case, backpressure applied all the way up to the TCP network
> >> source which meant we lost data. AIUI, current load balancing is round
> >> robin (and two other options prob not relevant). Would actual load
> >> balancing (e.g. send to node with lowest OS load, or number of active
> >> threads) be a reasonable request?
> >>
> >> On Wed, 6 Mar 2019 at 20:51, Joe Witt  wrote:
> >> >
> >> > This is generally workable (heterogenous node capabilities) in NiFi 
> >> > clustering.  But you do want to leverage back-pressure and load balanced 
> >> > connections so that faster nodes will have an opportunity to take on the 
> >> > workload for slower nodes.
> >> >
> >> > Thanks
> >> >
> >> > On Wed, Mar 6, 2019 at 3:48 PM James Srinivasan 
> >> >  wrote:
> >> >>
> >> >> Yes, we hit this with the new load balanced queues (which, to be fair, 
> >> >> we also had with remote process groups previously). Two "old" nodes got 
> >> >> saturated and their queues filled while three "new" nodes were fine.
> >> >>
> >> >> My "solution" was to move everything to new hardware which we had 
> >> >> inbound anyway.
> >> >>
> >> >> On Wed, 6 Mar 2019, 20:40 Jon Logan,  wrote:
> >> >>>
> >> >>> You may run into issues with different processing power, as some 
> >> >>> machines may be overwhelmed in order to saturate other machines.
> >> >>>
> >> >>> On Wed, Mar 6, 2019 at 3:34 PM Mark Payne  wrote:
> >> >>>>
> >> >>>> Chad,
> >> >>>>
> >> >>>> This should not be a problem, given that all nodes have enough 
> >> >>>> storage available to handle the influx of data.
> >> >>>>
> >> >>>> Thanks
> >> >>>> -Mark
> >> >>>>
> >> >>>>
> >> >>>> > On Mar 6, 2019, at 1:44 PM, Chad Woodhead  
> >> >>>> > wrote:
> >> >>>> >
> >> >>>> > Are there any negative effects of having filesystem mounts 
> >> >>>> > (dedicated mounts for each repo) used by the different NiFi 
> >> >>>> > repositories differ in size on NiFi nodes within the same cluster? 
> >> >>>> > For instance, if some nodes have a content_repo mount of 130 GB and 
> >> >>>> > other nodes have a content_repo mount of 125 GB, could that cause 
> >> >>>> > any problems or cause one node to be used more since it has more 
> >> >>>> > space? What about if the difference was larger, by say a 100 GB 
> >> >>>> > difference?
> >> >>>> >
> >> >>>> > Trying to repurpose old nodes and add them as NiFi nodes, but their 
> >> >>>> > mount sizes are different than my current cluster’s nodes and I’ve 
> >> >>>> > noticed I can’t set the max size limit to use of a particular mount 
> >> >>>> > for a repo.
> >> >>>> >
> >> >>>> > -Chad
> >> >>>>

Re: Connecting to a kerberized HBase 2 instance

2019-03-15 Thread James Srinivasan

That combo works for me. AIUI, the data API is compatible but the
management API (table creation etc) isn't.

On Fri, 15 Mar 2019, 17:29 Mike Thomsen,  wrote:

> Can the 1.1.2 client connect to a kerberized HBase 2 instance? We're stuck
> on NiFi 1.8.0 for now and someone upgraded us to HBase 2.0 for our new VMs.
> I can backport the NAR, but wanted to check. Getting a lot of
> ConnectionLoss errors from ZooKeeper looking for /hbase-secure and other
> things from the new installation and am trying to limit the range of
> possible problems here.
>
> Thanks,
>
> Mike
>

Re: Connecting to a kerberized HBase 2 instance

2019-03-15 Thread James Srinivasan

For us, we did a major HDP 2.6 -> 3.0 upgrade. This upgraded HBase to
2.x, and I was pleasantly surprised to find our NiFi flows (using
HBase_1_1_2_ClientService) just worked with their previous settings.

On Fri, 15 Mar 2019 at 19:48, Mike Thomsen  wrote:
>
> After some fighting, we got straight down to an issue with it throwing 
> MasterNotRunningException (or whatever it's called), indicating it couldn't 
> find the master on the configured host and port. So at this point, I assume 
> it's a security group issue.
>
> Thanks,
>
> Mike
>
> On Fri, Mar 15, 2019 at 1:44 PM Bryan Bende  wrote:
>>
>> Yes, James's statement is what I believe to be true as well.
>>
>> On Fri, Mar 15, 2019 at 1:33 PM James Srinivasan
>>  wrote:
>> >
>> > That combo works for me. AIUI, the data API is compatible but the 
>> > management API (table creation etc) isn't.
>> >
>> > On Fri, 15 Mar 2019, 17:29 Mike Thomsen,  wrote:
>> >>
>> >> Can the 1.1.2 client connect to a kerberized HBase 2 instance? We're 
>> >> stuck on NiFi 1.8.0 for now and someone upgraded us to HBase 2.0 for our 
>> >> new VMs. I can backport the NAR, but wanted to check. Getting a lot of 
>> >> ConnectionLoss errors from ZooKeeper looking for /hbase-secure and other 
>> >> things from the new installation and am trying to limit the range of 
>> >> possible problems here.
>> >>
>> >> Thanks,
>> >>
>> >> Mike

Re: ListenUDP: internal queue at maximum capacity, could not queue event

2019-06-05 Thread James Srinivasan

Presumably you'd want to mirror the stream to all nodes for when the
primary node changes?

On Wed, 5 Jun 2019, 13:46 Bryan Bende,  wrote:

> The processor is started on all nodes, but onTrigger method is only
> executed on the primary node.
>
> This is something we've discussed trying to improve before, but the
> real question is why are you sending data to the other nodes if you
> don't expect the processor to execute there?
>
> On Wed, Jun 5, 2019 at 7:04 AM Erik-Jan  wrote:
> >
> > I figured it out after further testing. The processor runs on all nodes,
> despite the explicit "run on primary node only" option that I selected. But
> only on the primary node the queue is processed. On the other nodes the
> queue gets filled until the max is reached after which the error message
> starts appearing. What I missed before is that the message is coming from
> the other, non-primary nodes.
> > I'm not sure if this is intended behavior or if it is a bug though! For
> me it's a bug since I really want this processor to run on the primary only.
> >
> > Op di 4 jun. 2019 16:34 schreef Erik-Jan :
> >>
> >> Hi Bryan,
> >>
> >> Yes I have considerably increased the numbers in the controller
> settings.
> >> I don't mind getting my hands dirty, increasing the timeout is worth a
> try.
> >>
> >> The errors seems to appear after quite a while. Usually I see these
> messages the next morning so testing and experimenting with this error
> takes a lot of time.
> >>
> >> Today I've been trying to reproduce this on a virtual machine with the
> same OS, Nifi and Java versions but to no avail. The difference is that
> this VM is not a cluster, has limited memory and cpu and still is able to
> handle much more UDP data with the error appearing only a few times so far
> after hours of running. It leads me to thinking there must be something in
> the configuration of the cluster thats causing this. I will also try a
> vanilla Nifi install on one of the nodes without clustering to see if my
> configuration and cluster setup is somehow the cause.
> >>
> >> Op di 4 jun. 2019 om 16:14 schreef Bryan Bende :
> >>>
> >>> Hi Erik,
> >>>
> >>> It sounds like you have tried most of the common tuning options that
> >>> can be done. I would have expected batching + increasing concurrent
> >>> tasks from 1 to 3-5 to be the biggest improvement.
> >>>
> >>> Have you increased the number of threads in your overall thread pool
> >>> according to your hardware? (from the top right menu controller
> >>> settings)
> >>>
> >>> I would be curious what happens if you did some tests increasing the
> >>> timeout where it attempts to place the message in the queue from 100ms
> >>> to 200ms and then maybe 500ms if it still happens.
> >>>
> >>> I know this requires a code change since that timeout is hard-coded,
> >>> but it sounds like you already went down that path with trying a
> >>> different queue :)
> >>>
> >>> -Bryan
> >>>
> >>> On Tue, Jun 4, 2019 at 4:28 AM Erik-Jan  wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > I'm experimenting with a locally installed 3 node nifi cluster. This
> cluster receives UDP packets on the primary node.
> >>> > These nodes are pretty powerful, have a good network connection,
> have lots of memory and SSD disks. I gave nifi 24G of java heap (xms and
> xmx).
> >>> >
> >>> > I have configured a ListenUDP processor that listens on a UDP port
> and it receives somewhere between 2 to 5 packets per 5 minutes.
> It's "Max size of message queue" is large enough (1M), I gave it 5
> concurrent tasks, it's running on the primary node only.
> >>> >
> >>> > The problem: after running for a while, I get the following error:
> "internal queue at maximum capacity, could not queue event."
> >>> >
> >>> > I have reviewed the source code and understand when this happens. It
> happens when the processor tries to store an event in a java
> LinkedBlockingQueue and that queue reached its maximum capacity. The
> offer() method has a 100ms timeout in which it waits for space to free up
> and then it fails and the event gets dropped. In the logs I see exactly 10
> of these error messages per second (10 x 100ms is 1 second). Despite these
> errors, I still get a very good rate of events that get through to the next
> processors. Actually, it seems pretty much all of the other events get
> through since the message rate in ListenUDP and the followup processor are
> very much alike. The followup processors can easily handle the load and
> there are no full queues, congestions or anything like that.
> >>> >
> >>> > What I have tried so far:
> >>> >
> >>> > Increasing the "Max Size of Message Queue" setting helps, but only
> delays the errors. They eventually return.
> >>> >
> >>> > Increasing heap space is a suggestion I read from a past post: I
> think 24G is more than enough actually? Perhaps even too much?
> >>> >
> >>> > Increasing parallelism: concurrent tasks set to 5 or 10 does not
> help.
> >>> >
> >>> > I modified the code to use an Arr

Re: ListenUDP: internal queue at maximum capacity, could not queue event

2019-06-05 Thread James Srinivasan

In our case the stream is UDP broadcast, so available to all nodes anyway.
I've been meaning to test UDP multicast but not got round to it yet.


On Wed, 5 Jun 2019, 17:03 Bryan Bende,  wrote:

> That is probably a valid point, but how about putting a load balancer
> in front to handle that?
>
> On Wed, Jun 5, 2019 at 11:30 AM James Srinivasan
>  wrote:
> >
> > Presumably you'd want to mirror the stream to all nodes for when the
> primary node changes?
> >
> > On Wed, 5 Jun 2019, 13:46 Bryan Bende,  wrote:
> >>
> >> The processor is started on all nodes, but onTrigger method is only
> >> executed on the primary node.
> >>
> >> This is something we've discussed trying to improve before, but the
> >> real question is why are you sending data to the other nodes if you
> >> don't expect the processor to execute there?
> >>
> >> On Wed, Jun 5, 2019 at 7:04 AM Erik-Jan  wrote:
> >> >
> >> > I figured it out after further testing. The processor runs on all
> nodes, despite the explicit "run on primary node only" option that I
> selected. But only on the primary node the queue is processed. On the other
> nodes the queue gets filled until the max is reached after which the error
> message starts appearing. What I missed before is that the message is
> coming from the other, non-primary nodes.
> >> > I'm not sure if this is intended behavior or if it is a bug though!
> For me it's a bug since I really want this processor to run on the primary
> only.
> >> >
> >> > Op di 4 jun. 2019 16:34 schreef Erik-Jan :
> >> >>
> >> >> Hi Bryan,
> >> >>
> >> >> Yes I have considerably increased the numbers in the controller
> settings.
> >> >> I don't mind getting my hands dirty, increasing the timeout is worth
> a try.
> >> >>
> >> >> The errors seems to appear after quite a while. Usually I see these
> messages the next morning so testing and experimenting with this error
> takes a lot of time.
> >> >>
> >> >> Today I've been trying to reproduce this on a virtual machine with
> the same OS, Nifi and Java versions but to no avail. The difference is that
> this VM is not a cluster, has limited memory and cpu and still is able to
> handle much more UDP data with the error appearing only a few times so far
> after hours of running. It leads me to thinking there must be something in
> the configuration of the cluster thats causing this. I will also try a
> vanilla Nifi install on one of the nodes without clustering to see if my
> configuration and cluster setup is somehow the cause.
> >> >>
> >> >> Op di 4 jun. 2019 om 16:14 schreef Bryan Bende :
> >> >>>
> >> >>> Hi Erik,
> >> >>>
> >> >>> It sounds like you have tried most of the common tuning options that
> >> >>> can be done. I would have expected batching + increasing concurrent
> >> >>> tasks from 1 to 3-5 to be the biggest improvement.
> >> >>>
> >> >>> Have you increased the number of threads in your overall thread pool
> >> >>> according to your hardware? (from the top right menu controller
> >> >>> settings)
> >> >>>
> >> >>> I would be curious what happens if you did some tests increasing the
> >> >>> timeout where it attempts to place the message in the queue from
> 100ms
> >> >>> to 200ms and then maybe 500ms if it still happens.
> >> >>>
> >> >>> I know this requires a code change since that timeout is hard-coded,
> >> >>> but it sounds like you already went down that path with trying a
> >> >>> different queue :)
> >> >>>
> >> >>> -Bryan
> >> >>>
> >> >>> On Tue, Jun 4, 2019 at 4:28 AM Erik-Jan  wrote:
> >> >>> >
> >> >>> > Hi,
> >> >>> >
> >> >>> > I'm experimenting with a locally installed 3 node nifi cluster.
> This cluster receives UDP packets on the primary node.
> >> >>> > These nodes are pretty powerful, have a good network connection,
> have lots of memory and SSD disks. I gave nifi 24G of java heap (xms and
> xmx).
> >> >>> >
> >> >>> > I have configured a ListenUDP processor that listens on a UDP
> port and it receives somewhere between 2 to 5 pa

Kerberos Ticket Renewal (when not updating Hadoop user)

2019-06-12 Thread James Srinivasan

Hi all,

I'm finally getting around to fixing up some deprecation issues with
our use of Kerberos with Accumulo and GeoMesa
(https://github.com/locationtech/geomesa/). Because I didn't know any
better at the time, I used the KerberosToken ctor specifying that the
Hadoop user should be replaced. Combined with a thread to periodically
renew the ticket (calling
UserGroupInformation.getCurrentUser.checkTGTAndReloginFromKeytab()),
this has worked nicely for us.

However, there are some unfortunate side effects of updating the
Hadoop user - for instance, subsequent HDFS operations use the new
user, who may not have the same permissions as the original user in a
Zeppelin-type notebook environment. Plus the replaceCurrentUser param
is deprecated and removed in Accumulo 2.0. So I'm keen on not
replacing the Hadoop user, but how do I handle ticket renewal?

Thanks very much,

James

Re: Kerberos Ticket Renewal (when not updating Hadoop user)

2019-06-13 Thread James Srinivasan

Err, my bad - meant to send this to the Accumulo list!

Sorry!

On Thu, 13 Jun 2019, 18:07 Jeff,  wrote:

> Hello James,
>
> For our Hadoop processors, we generally don't do any explicit relogins/TGT
> renewal.  It's handled implicitly by the Hadoop libs.  PR 2360 [1] is the
> primary change-set to allow this in NiFi, and several NiFI JIRAs (mainly
> NIFI-3472 [2]) are referenced in that pull request if you are interested in
> doing further reading.
>
> HiveConnectionPool (across various Hive versions) is the only component
> that comes to mind where we explicitly try to do a relogin.  NIFI-5134 [3]
> contains more information on that.
>
> Hope this information helps!
>
> [1] https://github.com/apache/nifi/pull/2360
> [2] https://issues.apache.org/jira/browse/NIFI-3472
> [3] https://issues.apache.org/jira/browse/NIFI-5134
>
> On Wed, Jun 12, 2019 at 4:06 PM James Srinivasan <
> james.sriniva...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm finally getting around to fixing up some deprecation issues with
>> our use of Kerberos with Accumulo and GeoMesa
>> (https://github.com/locationtech/geomesa/). Because I didn't know any
>> better at the time, I used the KerberosToken ctor specifying that the
>> Hadoop user should be replaced. Combined with a thread to periodically
>> renew the ticket (calling
>> UserGroupInformation.getCurrentUser.checkTGTAndReloginFromKeytab()),
>> this has worked nicely for us.
>>
>> However, there are some unfortunate side effects of updating the
>> Hadoop user - for instance, subsequent HDFS operations use the new
>> user, who may not have the same permissions as the original user in a
>> Zeppelin-type notebook environment. Plus the replaceCurrentUser param
>> is deprecated and removed in Accumulo 2.0. So I'm keen on not
>> replacing the Hadoop user, but how do I handle ticket renewal?
>>
>> Thanks very much,
>>
>> James
>>
>

Re: Kerberos Ticket Renewal (when not updating Hadoop user)

2019-06-13 Thread James Srinivasan

Itym more things to worry about (eg race conditions)!

Will definitely be useful when I get round to PRs for Accumulo NiFi
processors.

On Thu, 13 Jun 2019, 19:36 Jeff,  wrote:

> James,
>
> No worries!  At least you now have another point of reference. :)
>
> On Thu, Jun 13, 2019 at 1:11 PM James Srinivasan <
> james.sriniva...@gmail.com> wrote:
>
>> Err, my bad - meant to send this to the Accumulo list!
>>
>> Sorry!
>>
>> On Thu, 13 Jun 2019, 18:07 Jeff,  wrote:
>>
>>> Hello James,
>>>
>>> For our Hadoop processors, we generally don't do any explicit
>>> relogins/TGT renewal.  It's handled implicitly by the Hadoop libs.  PR 2360
>>> [1] is the primary change-set to allow this in NiFi, and several NiFI JIRAs
>>> (mainly NIFI-3472 [2]) are referenced in that pull request if you are
>>> interested in doing further reading.
>>>
>>> HiveConnectionPool (across various Hive versions) is the only component
>>> that comes to mind where we explicitly try to do a relogin.  NIFI-5134 [3]
>>> contains more information on that.
>>>
>>> Hope this information helps!
>>>
>>> [1] https://github.com/apache/nifi/pull/2360
>>> [2] https://issues.apache.org/jira/browse/NIFI-3472
>>> [3] https://issues.apache.org/jira/browse/NIFI-5134
>>>
>>> On Wed, Jun 12, 2019 at 4:06 PM James Srinivasan <
>>> james.sriniva...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm finally getting around to fixing up some deprecation issues with
>>>> our use of Kerberos with Accumulo and GeoMesa
>>>> (https://github.com/locationtech/geomesa/). Because I didn't know any
>>>> better at the time, I used the KerberosToken ctor specifying that the
>>>> Hadoop user should be replaced. Combined with a thread to periodically
>>>> renew the ticket (calling
>>>> UserGroupInformation.getCurrentUser.checkTGTAndReloginFromKeytab()),
>>>> this has worked nicely for us.
>>>>
>>>> However, there are some unfortunate side effects of updating the
>>>> Hadoop user - for instance, subsequent HDFS operations use the new
>>>> user, who may not have the same permissions as the original user in a
>>>> Zeppelin-type notebook environment. Plus the replaceCurrentUser param
>>>> is deprecated and removed in Accumulo 2.0. So I'm keen on not
>>>> replacing the Hadoop user, but how do I handle ticket renewal?
>>>>
>>>> Thanks very much,
>>>>
>>>> James
>>>>
>>>

Re: Custom Processor Upgrade

2019-08-15 Thread James Srinivasan

I find strace (or procmon for Windows) very handy to debug such resource
loading issues.

On Thu, 15 Aug 2019, 19:02 Bryan Bende,  wrote:

> I was making sure you didn't have any code that was dependent on the
> internal structure of how the NARs are unpacked.
>
> I can't explain why it can't find the application-context.xml since I
> don't have access to your code, but I don't see why that would be
> related to moving to NAR_INF from META_INF, nothing should really be
> relying on that structure.
>
> On Thu, Aug 15, 2019 at 1:27 PM Bimal Mehta  wrote:
> >
> > Its inside one of the jars within the NAR_INF folder. Does it need to be
> somehwere else?
> > Also I think we extended the AbstractNiFiProcessor from the custom kylo
> processor while migrating it as kylo processor was not working as is in our
> environment.. Will check that and have it packaged in the processors module.
> >
> > On Thu, Aug 15, 2019 at 1:50 AM Bryan Bende  wrote:
> >>
> >> Where is application-context.xml in your NAR?
> >>
> >> And how are you trying to load it in
> com.thinkbiganalytics.nifi.processor.AbstractNiFiProcessor ?
> >>
> >> I would expect it to be packaged into the jar that contains your
> processors, most likely in src/main/resources of the processors module
> which then ends up at the root of the jar.
> >>
> >> On Wed, Aug 14, 2019 at 5:36 PM Bimal Mehta  wrote:
> >>>
> >>> Ahh, seems like a Springboot error.
> >>> Is it to do with upgraded Jetty server ?
> >>>
> >>> Caused by:
> org.springframework.beans.factory.BeanDefinitionStoreException: Unexpected
> exception parsing XML document from class path resource
> [application-context.xml]; nested exception is
> org.springframework.beans.FatalBeanException: Class
> [org.springframework.context.config.ContextNamespaceHandler] for namespace [
> http://www.springframework.org/schema/context] does not implement the
> [org.springframework.beans.factory.xml.NamespaceHandler] interface
> >>> at
> org.springframework.beans.factory.xml.XmlBeanDefinitionReader.doLoadBeanDefinitions(XmlBeanDefinitionReader.java:414)
> >>> at
> org.springframework.beans.factory.xml.XmlBeanDefinitionReader.loadBeanDefinitions(XmlBeanDefinitionReader.java:336)
> >>> at
> org.springframework.beans.factory.xml.XmlBeanDefinitionReader.loadBeanDefinitions(XmlBeanDefinitionReader.java:304)
> >>> at
> org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:181)
> >>> at
> org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:217)
> >>> at
> org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:188)
> >>> at
> org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:252)
> >>> at
> org.springframework.context.support.AbstractXmlApplicationContext.loadBeanDefinitions(AbstractXmlApplicationContext.java:127)
> >>> at
> org.springframework.context.support.AbstractXmlApplicationContext.loadBeanDefinitions(AbstractXmlApplicationContext.java:93)
> >>> at
> org.springframework.context.support.AbstractRefreshableApplicationContext.refreshBeanFactory(AbstractRefreshableApplicationContext.java:129)
> >>> at
> org.springframework.context.support.AbstractApplicationContext.obtainFreshBeanFactory(AbstractApplicationContext.java:609)
> >>> at
> org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:510)
> >>> at
> org.springframework.context.support.ClassPathXmlApplicationContext.(ClassPathXmlApplicationContext.java:139)
> >>> at
> org.springframework.context.support.ClassPathXmlApplicationContext.(ClassPathXmlApplicationContext.java:83)
> >>> at
> com.thinkbiganalytics.nifi.processor.AbstractNiFiProcessor.init(AbstractNiFiProcessor.java:48)
> >>> at
> org.apache.nifi.processor.AbstractSessionFactoryProcessor.initialize(AbstractSessionFactoryProcessor.java:63)
> >>> at
> org.apache.nifi.controller.ExtensionBuilder.createLoggableProcessor(ExtensionBuilder.java:421)
> >>> ... 50 common frames omitted
> >>> Caused by: org.springframework.beans.FatalBeanException: Class
> [org.springframework.context.config.ContextNamespaceHandler] for namespace [
> http://www.springframework.org/schema/context] does not implement the
> [org.springframework.beans.factory.xml.NamespaceHandler] interface
> >>> at
> org.springframework.beans.factory.xml.DefaultNamespaceHandlerResolver.resolve(DefaultNamespaceHandlerResolver.java:128)
> >>> at
> org.springframework.beans.factory.xml.BeanDefinitionParserDelegate.parseCustomElement(BeanDefinitionParserDelegate.java:1406)
> >>> at
> org.springframework.beans.factory.xml.BeanDefinitionParserDelegate.parseCustomElement(BeanDefinitionParserDelegate.java:1401)
> >>> at
> org.springframework.beans.factory.xml.DefaultBeanDefinitionDocumentReader.parseBeanDefinitions(DefaultBeanDefiniti

Re: Regarding setting up multiple DistribitedMapCacheServer controler service

2020-12-16 Thread James Srinivasan

If you are running on a cluster, you might want to consider an
alternative such as HBase_2_ClientMapCacheService otherwise the node
running the DistributedMapCacheServer becomes a SPOF.

(others: please correct me if I am wrong, this was the case a while
back when I moved to using the HBase server for our systems)

On Wed, 16 Dec 2020 at 16:13, sanjeet rath  wrote:
>
> Thanks Mark for clarifying.
>
> On Wed, 16 Dec 2020, 9:20 pm Mark Payne,  wrote:
>>
>> Sanjeet,
>>
>> You can certainly setup multiple instances of the DistributedMapCacheServer. 
>> I think the point that the article was trying to get at is probably that 
>> adding a second DistributedMapCacheClient does not necessitate adding a 
>> second server. Multiple clients can certainly use the same server.
>>
>> That said, there may be benefits to having multiple servers. Specifically, 
>> for DetectDuplicate, there may be some things to consider. Because the 
>> server is configured with a max number of elements to add, if you have two 
>> flows, and Flow A processes 1 million FlowFiles per hour, and Flow B 
>> processes 100 FlowFiles per hour, you will almost certainly want two 
>> different servers. That’s because you could have a FlowFile come into Flow 
>> B, not a duplicate. Then Flow A fills up the cache with 10,000 FlowFiles of 
>> its own. Then a duplicate comes into Flow B, but the cache doesn’t know 
>> about it because Flow A has already filled the cache. So in that case, it 
>> would help to have two. Only down side is that now you have to many two 
>> different Controller Services (generally not a problem) and ensure that you 
>> have firewalls opened, etc. to access it.
>>
>> Thanks
>> -Mark
>>
>> On Dec 16, 2020, at 10:37 AM, sanjeet rath  wrote:
>>
>> Hi All,
>>
>> Hope you are well.
>> I need one clarification regarding DistribitedMapCacheServer controler 
>> service.
>> Our build structure is on same cluster 2 teams are working in 2 different PG.
>> Now both team are using DetectDuplicate processor for which they need 
>> DustributedMapCacheClient.
>>
>> My question is should i set up 2 different  DistribitedMapCacheServer on 2 
>> different port or should i use 1
>> DistribitedMapCacheServer with one port (lets say 4557 default ) and that 
>> port will be used by both the teams(both the PG)
>>
>> I have gone through previous internet artcle and comunity discussion, where 
>> it is mentioned the DistribitedMapCacheServer should set up only once per 
>> cluster with one port and  multiple DMCclient can access this port.
>>
>> Please advise is there any restriction setting up multiple 
>> DistribitedMapCacheServer in a cluster.
>>
>> Thank you in advance,
>> Sanjeet
>>
>>
>>

Processor classpath

2017-07-04 Thread James Srinivasan

Hi,

I'm developing a processor which needs to read some of its config from
the classpath. Reading the docs etc., NiFi's classpath is a little
funky - where's the best (least worst?) location for such files? I
note that the HDFS processors can read their config (core-site.xml
etc) from the classpath, but I can't find where that actually is.

Thanks in advance,

James

Re: Processor classpath

2017-07-05 Thread James Srinivasan

Thanks, I ended up stracing the NiFi process to see where it was
actually trying to load my config file from. Helpfully, the list of
locations included the NiFi conf directory, which is just perfect for
me.

I don't want to keep the config file (core-site.xml for Hadoop) in my
NAR itself since it will obviously change between different
deployments. I was inspired by:

https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-extension-utils/nifi-hadoop-utils/src/main/java/org/apache/nifi/processors/hadoop/AbstractHadoopProcessor.java#L69-L70

Which says if that property isn't set, then NiFi searches the
classpath for the necessary file.

Thanks again,

James

On 5 July 2017 at 00:07, Jeremy Dyer  wrote:
> Hi James,
>
> NiFi doesn't do anything too crazy from what Java itself does. The "funky"
> part is the isolated classpath loaders. Really since your building a custom
> processor I would also assume your building a custom NAR. If that is the
> case you can read that resource from the "nar" just like you would any "jar"
> in java. The HDFS processor your referenced loads an external file that
> isn't bundled in the actual classpath at all. If that is the functionality
> you want I suggest following the template from the HDFS processors as you
> stated. Otherwise just carryon as usual and pretend its a regular java
> application.
>
> On Tue, Jul 4, 2017 at 4:31 PM, James Srinivasan
>  wrote:
>>
>> Hi,
>>
>> I'm developing a processor which needs to read some of its config from
>> the classpath. Reading the docs etc., NiFi's classpath is a little
>> funky - where's the best (least worst?) location for such files? I
>> note that the HDFS processors can read their config (core-site.xml
>> etc) from the classpath, but I can't find where that actually is.
>>
>> Thanks in advance,
>>
>> James
>
>

Re: Processor classpath

2017-07-07 Thread James Srinivasan

Hi Bryan,

That all makes sense...except this is my own NiFi processor, not one
of the built-in Hadoop ones:

https://github.com/jrs53/geomesa-nifi

So I'm still confused as to why core-site.xml is picked up from NiFi's
conf directory, but very happy that it is!

On 5 July 2017 at 21:55, Bryan Bende  wrote:
> James,
>
> Just to clarify... the conf directory of NiFi is not on the classpath
> of processors.
>
> The behavior with core-site.xml is because the Hadoop client used by
> NiFi provides this behavior, and not NiFi itself. Basically if you
> create a new Hadoop Configuration object and you don't provide it with
> a list of resources, it will find any that are on the classpath, which
> in this case would be any that are bundled in JARs inside the given
> NAR, or anything in NiFI's lib directory. Putting stuff in NiFi's lib
> directory makes it visible to every processor and is generally
> dangerous, which is why the existing Hadoop processors have a property
> for the path to the core-site.xml file and then load it like so:
>
> String resource = ... // get the value of the property where the path is
> Configuration config = new Configuration();
> config.addResource(new Path(resource.trim()));
>
> Hope that helps.
>
> -Bryan
>
>
>
> On Wed, Jul 5, 2017 at 4:15 PM, James Srinivasan
>  wrote:
>> Thanks, I ended up stracing the NiFi process to see where it was
>> actually trying to load my config file from. Helpfully, the list of
>> locations included the NiFi conf directory, which is just perfect for
>> me.
>>
>> I don't want to keep the config file (core-site.xml for Hadoop) in my
>> NAR itself since it will obviously change between different
>> deployments. I was inspired by:
>>
>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-extension-utils/nifi-hadoop-utils/src/main/java/org/apache/nifi/processors/hadoop/AbstractHadoopProcessor.java#L69-L70
>>
>> Which says if that property isn't set, then NiFi searches the
>> classpath for the necessary file.
>>
>> Thanks again,
>>
>> James
>>
>> On 5 July 2017 at 00:07, Jeremy Dyer  wrote:
>>> Hi James,
>>>
>>> NiFi doesn't do anything too crazy from what Java itself does. The "funky"
>>> part is the isolated classpath loaders. Really since your building a custom
>>> processor I would also assume your building a custom NAR. If that is the
>>> case you can read that resource from the "nar" just like you would any "jar"
>>> in java. The HDFS processor your referenced loads an external file that
>>> isn't bundled in the actual classpath at all. If that is the functionality
>>> you want I suggest following the template from the HDFS processors as you
>>> stated. Otherwise just carryon as usual and pretend its a regular java
>>> application.
>>>
>>> On Tue, Jul 4, 2017 at 4:31 PM, James Srinivasan
>>>  wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm developing a processor which needs to read some of its config from
>>>> the classpath. Reading the docs etc., NiFi's classpath is a little
>>>> funky - where's the best (least worst?) location for such files? I
>>>> note that the HDFS processors can read their config (core-site.xml
>>>> etc) from the classpath, but I can't find where that actually is.
>>>>
>>>> Thanks in advance,
>>>>
>>>> James
>>>
>>>

Re: Processor classpath

2017-07-07 Thread James Srinivasan

Here's where strace tells me my processor is looking for core-site.xml:

[nifi@nifi nifi-1.3.0]$ strace -p 22189 -f 2>&1 | grep core-site.xml
[pid 22284] stat("/opt/nifi/nifi-1.3.0/conf/core-site.xml",  
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/core-site.xml",
0x7fa0831f1df0) = -1 ENOENT (No such file or directory)
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
 
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/core-site.xml",
 
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
 
[pid 22284] stat("/opt/nifi/nifi-1.3.0/conf/core-site.xml",  
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/core-site.xml",
 
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
0x7fa0831f2600) = -1 ENOENT (No such file or directory)
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/core-site.xml",
0x7fa0831f2680) = -1 ENOENT (No such file or directory)
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
0x7fa0831f2680) = -1 ENOENT (No such file or directory)
[pid 22284] stat("/opt/nifi/nifi-1.3.0/conf/core-site.xml",
0x7fa0831f23c0) = -1 ENOENT (No such file or directory)
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/core-site.xml",
0x7fa0831f2440) = -1 ENOENT (No such file or directory)
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
0x7fa0831f2440) = -1 ENOENT (No such file or directory)
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/core-site.xml",
0x7fa0831f24c0) = -1 ENOENT (No such file or directory)
[pid 22284] 
stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
0x7fa0831f24c0) = -1 ENOENT (No such file or directory)

I've also separately dumped the classpath and confirmed that the NiFi
conf directory isn't on it. The Hadoop docs tell me it searches the
classpath for core-site.xml [1], so I'm confused as to why it works,
but happy that it does work!

[1] 
http://hadoop.apache.org/docs/r2.6.4/api/org/apache/hadoop/conf/Configuration.html
as one example

On 7 July 2017 at 16:17, Bryan Bende  wrote:
> Hi James,
>
> I'm saying that it should not find a core-site.xml in NiFi's conf
> directory because the conf directory is not on the classpath of
> processors.
>
> Are you saying you have tested putting it there and believe it is finding it?
>
> -Bryan
>
> On Fri, Jul 7, 2017 at 11:08 AM, James Srinivasan
>  wrote:
>> Hi Bryan,
>>
>> That all makes sense...except this is my own NiFi processor, not one
>> of the built-in Hadoop ones:
>>
>> https://github.com/jrs53/geomesa-nifi
>>
>> So I'm still confused as to why core-site.xml is picked up from NiFi's
>> conf directory, but very happy that it is!
>>
>> On 5 July 2017 at 21:55, Bryan Bende  wrote:
>>> James,
>>>
>>> Just to clarify... the conf directory of NiFi is not on the classpath
>>> of processors.
>>>
>>> The behavior with core-site.xml is because the Hadoop client used by
>>> NiFi provides this behavior, and not NiFi itself. Basically if you
>>> create a new Hadoop Configuration object and you don't provide it with
>>> a list of resources, it will find any that are on the classpath, which
>>> in this case would be any that are bundled in JARs inside the given
>>> NAR, or anything in NiFI's lib directory. Putting stuff in NiFi's lib
>>> directory makes it visible to every processor and is generally
>>> dangerous, which is why the existing Hadoop processors have a property
>>> for the path to the core-site.xml file and then load it like so:
>>>
>>> String resource = ... // get the value of the property where the path is
>>> Configuration config = new Configuration();
>>> config.addResource(new Path(resource.trim()));
>>>
>>> Hope that helps.
>>>
>>> -Bryan
>>>
>>>
>>&

Re: Processor classpath

2017-07-07 Thread James Srinivasan

No worries, while "it's working but I don't entirely know why"
satisfies me immediately, it does keep niggling at the back of my mind
(not least in case it randomly stops working!).

Thanks for all the help,

James

On 7 July 2017 at 17:44, Bryan Bende  wrote:
> Actually after looking at the code, I stand corrected...
>
> The conf directory is actually on the class path of the system class
> loader when NiFi is launched, so this makes sense.
>
> The classpath hierarchy is something like...
>
> - System Class Loader (conf dir and jar files in lib)
> - Jetty NAR Class Loader with a parent of System Class Loader
> - Geomesa NAR Class Loader with a parent of Jetty NAR Class Loader
>
> So when the code in your Gemoesa NAR looks for it, it will traverse up
> the chain to the system class loader and find it there.
>
> The classpath of the system class loader should be what you see in
> nifi-bootstrap.log for the command that started NiFi.
>
> Sorry for the confusion.
>
>
> On Fri, Jul 7, 2017 at 12:03 PM, James Srinivasan
>  wrote:
>> Here's where strace tells me my processor is looking for core-site.xml:
>>
>> [nifi@nifi nifi-1.3.0]$ strace -p 22189 -f 2>&1 | grep core-site.xml
>> [pid 22284] stat("/opt/nifi/nifi-1.3.0/conf/core-site.xml",  
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/core-site.xml",
>> 0x7fa0831f1df0) = -1 ENOENT (No such file or directory)
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
>>  
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/core-site.xml",
>>  
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
>>  
>> [pid 22284] stat("/opt/nifi/nifi-1.3.0/conf/core-site.xml",  
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/core-site.xml",
>>  
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
>> 0x7fa0831f2600) = -1 ENOENT (No such file or directory)
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/core-site.xml",
>> 0x7fa0831f2680) = -1 ENOENT (No such file or directory)
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
>> 0x7fa0831f2680) = -1 ENOENT (No such file or directory)
>> [pid 22284] stat("/opt/nifi/nifi-1.3.0/conf/core-site.xml",
>> 0x7fa0831f23c0) = -1 ENOENT (No such file or directory)
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/core-site.xml",
>> 0x7fa0831f2440) = -1 ENOENT (No such file or directory)
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/nifi-jetty-bundle-1.3.0.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
>> 0x7fa0831f2440) = -1 ENOENT (No such file or directory)
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/core-site.xml",
>> 0x7fa0831f24c0) = -1 ENOENT (No such file or directory)
>> [pid 22284] 
>> stat("/opt/nifi/nifi-1.3.0/work/nar/extensions/geomesa-nifi-nar-0.9.0-SNAPSHOT.nar-unpacked/META-INF/bundled-dependencies/core-site.xml",
>> 0x7fa0831f24c0) = -1 ENOENT (No such file or directory)
>>
>> I've also separately dumped the classpath and confirmed that the NiFi
>> conf directory isn't on it. The Hadoop docs tell me it searches the
>> classpath for core-site.xml [1], so I'm confused as to why it works,
>> but happy that it does work!
>>
>> [1] 
>> http://hadoop.apache.org/docs/r2.6.4/api/org/apache/hadoop/conf/Configuration.html
>> as one example
>>
>> On 7 July 2017 at 16:17, Bryan Bende  wrote:
>>> Hi James,
>>>
>>> I'm saying that it should not find a core-site.xml in NiFi's conf
>>> directory because the conf directory is not on the classpath of
>>> processors.
>>>
>>> Are you saying you have tested putting it there and believe it is finding 
>>> it?
>>>
>>> -Bryan
>>>
>>>

Re: Getting untrusted proxy message while trying to setup secure NIFI cluster

2017-07-13 Thread James Srinivasan

Hi,

I found I had to add this to authorizations.xml for R & W, with
corresponding users.xml entries:







Still not entirely sure my secured cluster is fully set up correctly -
planning on writing up how we did it tho.

James


On 13 July 2017 at 13:47, mathes waran  wrote:
> I am using nifi V-1.3, and trying to setup 3 node secure NIFI cluster.
>
> I have added all the required properties, I can see nodes sending heartbeats
> in logs in all the nodes but on screen I'm getting Untrusted proxy message
> for all nodes. error screen shot attached.
>
> Error log getting as NiFiAuthenticationFilter Rejecting access to web api:
> Untrusted proxy CN=hostname
>
>  Find the nifi properties below:
> 
> file-provider
> org.apache.nifi.authorization.FileAuthorizer
>  name="AuthorizationsFile">./conf/authorizations.xml
> ./conf/users.xml
> mat...@example.com
> 
> CN=no...@example.com,
> OU=NIFI
> CN=CN=no...@example.com,
> OU=NIFI
> CN=CN=no...@example.com,
> OU=NIFI
> 
>
>
> could you please tell if anybody overcomes it.
>
> Thanks,
> Matheswaran

Processor using Kerberos keytab auth - can't renew TGT

2017-07-14 Thread James Srinivasan

Hi all,

I have a NiFi processor which uses Kerberos keytab authentication to
write data to Accumulo. I have a separate thread which periodically
runs in order to try renewing my TGT
(UserGroupInformation.getCurrentUser.checkTGTAndReloginFromKeytab()).

This code works fine outside NiFi, but inside NiFi while the initial
login is fine, on subsequent attempts to check the TGT, the
UserGroupInformation class seems to think it is using ticket cache,
not keytab authentication (i.e.
UserGroupInformation.getCurrentUser.isFromKeytab is false).

I notice the Hadoop processors support some Kerberos authentication
options (I'm not yet using any of those processors, but would like to
in other flows). Could this be interacting badly with my code?

Thanks very much,

James

Re: Processor using Kerberos keytab auth - can't renew TGT

2017-07-14 Thread James Srinivasan

Hi Georg,

I am indeed using open-jdk8 on CentOS 7.3, but I'm not sure why my
standalone app is ok, whereas the same code in NiFi isn't. How did you
fix the JCE policies?

I'm guessing it is something to do with the shared
UserGroupInformation class. Which makes me wonder how (if) it will
work with multiple processors potentially using different keytabs. Am
wondering if this applies to me:

https://github.com/apache/nifi/blob/rel/nifi-1.3.0/nifi-nar-bundles/nifi-extension-utils/nifi-hadoop-utils/src/main/java/org/apache/nifi/hadoop/KerberosProperties.java#L32

Thanks,

James

On 14 July 2017 at 14:16, Georg Heiler  wrote:
> Hi Joe,
>
> we recently had a similar problem. For us it turned out that we are using
> the latest open-jdk8 which no longer is providing the JCE policies required
> for strong cryptography out of the box on cents 7.3.
>
> regards,
> Georg
>
> Joe Witt  schrieb am Fr., 14. Juli 2017 um 15:12 Uhr:
>>
>> James,
>>
>> I know Jeff Storck has recently been doing some work around
>> Kerberos/TGT renewal.  Hopefully he can share some of his
>> observations/work back on this thread soon.
>>
>> Thanks
>>
>> On Fri, Jul 14, 2017 at 8:48 AM, James Srinivasan
>>  wrote:
>> > Hi all,
>> >
>> > I have a NiFi processor which uses Kerberos keytab authentication to
>> > write data to Accumulo. I have a separate thread which periodically
>> > runs in order to try renewing my TGT
>> > (UserGroupInformation.getCurrentUser.checkTGTAndReloginFromKeytab()).
>> >
>> > This code works fine outside NiFi, but inside NiFi while the initial
>> > login is fine, on subsequent attempts to check the TGT, the
>> > UserGroupInformation class seems to think it is using ticket cache,
>> > not keytab authentication (i.e.
>> > UserGroupInformation.getCurrentUser.isFromKeytab is false).
>> >
>> > I notice the Hadoop processors support some Kerberos authentication
>> > options (I'm not yet using any of those processors, but would like to
>> > in other flows). Could this be interacting badly with my code?
>> >
>> > Thanks very much,
>> >
>> > James

Re: Processor using Kerberos keytab auth - can't renew TGT

2017-07-14 Thread James Srinivasan

Hmm, so it seems updating the Hadoop version used by my processor from
2.6.0 to 2.7.3 has fixed the problem. Testing a little more just to
make sure...

On 14 July 2017 at 14:48, Georg Heiler  wrote:
> We just applied the standard fix to enable the JCE extensions i.e. copied
> the files into the right place. I was on vacation last week but it looked
> like the fix we had been searching for for a while. We were still conducting
> some more testing to see if this actually fixed the problem.
>
> But without the fix we could observe your described problem on other long
> running services like HBase as well.
>
> James Srinivasan  schrieb am Fr., 14. Juli 2017
> um 15:36 Uhr:
>>
>> Hi Georg,
>>
>> I am indeed using open-jdk8 on CentOS 7.3, but I'm not sure why my
>> standalone app is ok, whereas the same code in NiFi isn't. How did you
>> fix the JCE policies?
>>
>> I'm guessing it is something to do with the shared
>> UserGroupInformation class. Which makes me wonder how (if) it will
>> work with multiple processors potentially using different keytabs. Am
>> wondering if this applies to me:
>>
>>
>> https://github.com/apache/nifi/blob/rel/nifi-1.3.0/nifi-nar-bundles/nifi-extension-utils/nifi-hadoop-utils/src/main/java/org/apache/nifi/hadoop/KerberosProperties.java#L32
>>
>> Thanks,
>>
>> James
>>
>> On 14 July 2017 at 14:16, Georg Heiler  wrote:
>> > Hi Joe,
>> >
>> > we recently had a similar problem. For us it turned out that we are
>> > using
>> > the latest open-jdk8 which no longer is providing the JCE policies
>> > required
>> > for strong cryptography out of the box on cents 7.3.
>> >
>> > regards,
>> > Georg
>> >
>> > Joe Witt  schrieb am Fr., 14. Juli 2017 um 15:12
>> > Uhr:
>> >>
>> >> James,
>> >>
>> >> I know Jeff Storck has recently been doing some work around
>> >> Kerberos/TGT renewal.  Hopefully he can share some of his
>> >> observations/work back on this thread soon.
>> >>
>> >> Thanks
>> >>
>> >> On Fri, Jul 14, 2017 at 8:48 AM, James Srinivasan
>> >>  wrote:
>> >> > Hi all,
>> >> >
>> >> > I have a NiFi processor which uses Kerberos keytab authentication to
>> >> > write data to Accumulo. I have a separate thread which periodically
>> >> > runs in order to try renewing my TGT
>> >> > (UserGroupInformation.getCurrentUser.checkTGTAndReloginFromKeytab()).
>> >> >
>> >> > This code works fine outside NiFi, but inside NiFi while the initial
>> >> > login is fine, on subsequent attempts to check the TGT, the
>> >> > UserGroupInformation class seems to think it is using ticket cache,
>> >> > not keytab authentication (i.e.
>> >> > UserGroupInformation.getCurrentUser.isFromKeytab is false).
>> >> >
>> >> > I notice the Hadoop processors support some Kerberos authentication
>> >> > options (I'm not yet using any of those processors, but would like to
>> >> > in other flows). Could this be interacting badly with my code?
>> >> >
>> >> > Thanks very much,
>> >> >
>> >> > James

Re: GetHTTP 403:Forbidden

2017-10-11 Thread James Srinivasan

Hi,

Doesn't sound like you have the certs set up correctly since
--insecure for curl skips certificate validation. I'm not aware of a
similar option for NiFi, but assuming you generated the certificate
yourself, searching for something like "java https self signed web"
should help. If the certificate was generated by a third party, then
make sure the appropriate intermediate and root certificates are also
in your store.

James

On 11 October 2017 at 20:37, pat  wrote:
> Greetings,
>
>   I am using GetHTTP to call
> https://xxx.mitre.org/xx/xx
>
> I have set up the StandardSSLContextService with my certs
>
> When GetHTTP runs I get the error:
>  [Timer-Driven Process Thread-6] o.a.nifi.processors.standard.GetHTTP
> GetHTTP[id=015e1160-a849-126f-0306-deadef9b45f3] received status code
> 403:Forbidden from https://xxx.mitre.org/xx/xx
>
> I can reach the site via firefox if I add a security exception.
> I can also reach the site with curl like:
> curl --noproxy "*" --insecure --cacert .ca.pem --cert .c_cert.pem --key
> ./c_key.pem   https://xxx.mitre.org/xx/xx
>
> I assume the problem is that I don't know how to have NIFI do something like
> the "--insecure" options.
> Is there a way to do this in NIFI (or a work around)
>
> thank you
>
>
>
> --
> Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/

Re: GetHTTP 403:Forbidden

2017-10-12 Thread James Srinivasan

I might've been wrong about the certificate issue, since the server is
returning a 403. What happens if you omit --insecure from the curl
command? Might the server be doing something funky e.g. looking at the
user agent string?

I suspect the equivalent of --insecure requires recompiling the
processor and is almost certainly not the right thing to do.

On 12 October 2017 at 13:13, Jones, Patrick L.  wrote:
> Thanks for the reply.
>
> I did not generate the certs myself, they were generated by a third 
> party.  As I said the only way I can access the site is with curl and 
> --insecure.  Does anyone know of a way in NIFI to do the equivalence of 
> "--insecure" ?  The curl that works is:
> curl --noproxy "*" --insecure --cacert ./ca.pem --cert ./c_cert.pem --key  
> ./c_key.pem   https://xxx.mitre.org/xx/xx
>
> thank you
>
>
> Pat
>
>
> -Original Message-
> From: James Srinivasan [mailto:james.sriniva...@gmail.com]
> Sent: Wednesday, October 11, 2017 4:15 PM
> To: users@nifi.apache.org
> Subject: Re: GetHTTP 403:Forbidden
>
> Hi,
>
> Doesn't sound like you have the certs set up correctly since --insecure for 
> curl skips certificate validation. I'm not aware of a similar option for 
> NiFi, but assuming you generated the certificate yourself, searching for 
> something like "java https self signed web"
> should help. If the certificate was generated by a third party, then make 
> sure the appropriate intermediate and root certificates are also in your 
> store.
>
> James
>
> On 11 October 2017 at 20:37, pat  wrote:
>> Greetings,
>>
>>   I am using GetHTTP to call
>> https://xxx.mitre.org/xx/xx
>>
>> I have set up the StandardSSLContextService with my certs
>>
>> When GetHTTP runs I get the error:
>>  [Timer-Driven Process Thread-6] o.a.nifi.processors.standard.GetHTTP
>> GetHTTP[id=015e1160-a849-126f-0306-deadef9b45f3] received status code
>> 403:Forbidden from https://xxx.mitre.org/xx/xx
>>
>> I can reach the site via firefox if I add a security exception.
>> I can also reach the site with curl like:
>> curl --noproxy "*" --insecure --cacert .ca.pem --cert .c_cert.pem --key
>> ./c_key.pem   https://xxx.mitre.org/xx/xx
>>
>> I assume the problem is that I don't know how to have NIFI do
>> something like the "--insecure" options.
>> Is there a way to do this in NIFI (or a work around)
>>
>> thank you
>>
>>
>>
>> --
>> Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/

Incorrect PublishKafka_0_10 documentation?

2017-11-07 Thread James Srinivasan

I've been struggling to get NiFi working with Kerberos authenticated
Kafka. According to the docs, the "Kerberos Service Name" property
specifies:

"The Kerberos principal name that Kafka runs as. This can be defined
either in Kafka's JAAS config or in Kafka's config. Corresponds to
Kafka's 'security.protocol' property.It is ignored unless one of the
SASL options of the  are selected."

First off, it doesn't correspond to Kafka's security.protocol property
- it corresponds to the JAAS serviceName property. Second, I'm not
sure it is a Kerberos principal name - in my (HDP) install, it is set
to "kafka", and using the full Kerberos principal name
("kafka@MYDOMAIN.LOCAL") doesn't work. I would submit a PR, but I'm
not 100% sure about the second bit.

Long story short, for my install setting this to "kafka" worked, plus
setting "Kerberos Principal" and "Kerberos Keytab" to suitable things,
and "Security Protocol" to "SASL_PLAINTEXT". In our environment, we
enforce explicit topic creation so having done that and granted
producer and consumer access to the correct users, everything works
nicely.

James

NiFi cluster with DistributedMapCacheServer/Client

2018-04-13 Thread James Srinivasan

Hi all,

Is there a recommended way to set up a
DistributedMapCacheServer/Client on a cluster, ideally with some
amount of HA (NiFi 1.3.0)? I'm using a shared persistence directory,
and when adding and enabling the controller it seems to start on my
primary node (but not the other two - status keeps saying "enabling"
rather than "enabled"). Adding the DistributedMapCacheClientService is
harder, because I have to specify the host it runs on. Setting it to
the current primary node works, but presumably won't fail over?

I guess the proper solution is to use the HBase versions (or even
implement my own Accumulo one for our cluster)

Thanks very much,

James

Re: NiFi cluster with DistributedMapCacheServer/Client

2018-04-13 Thread James Srinivasan

Thanks, I might try moving to the HBase implementation anyway because:

1) It is already in NiFi 1.3
2) We already have HBase installed (but unused) on our cluster
3) There doesn't seem to be a limit to the number of cache entries.
For our use case (avoiding downloading the same file multiple times)
it was always a bit icky to set the number of cache entries to
something that should be "big enough"

Thanks again,

James

On 13 April 2018 at 20:24, Joe Witt  wrote:
> James,
>
> You have it right about the proper solution path..  I think we have a
> Redis one in there now too that might be interesting (not in 1.3.0
> perhaps but..).
>
> We offered a simple out of the box one early and to ensure the
> interfaces are right.  Since then the community has popped up some
> real/stronger implementations like you're mentioning.
>
> Thanks
>
> On Fri, Apr 13, 2018 at 7:14 PM, James Srinivasan
>  wrote:
>> Hi all,
>>
>> Is there a recommended way to set up a
>> DistributedMapCacheServer/Client on a cluster, ideally with some
>> amount of HA (NiFi 1.3.0)? I'm using a shared persistence directory,
>> and when adding and enabling the controller it seems to start on my
>> primary node (but not the other two - status keeps saying "enabling"
>> rather than "enabled"). Adding the DistributedMapCacheClientService is
>> harder, because I have to specify the host it runs on. Setting it to
>> the current primary node works, but presumably won't fail over?
>>
>> I guess the proper solution is to use the HBase versions (or even
>> implement my own Accumulo one for our cluster)
>>
>> Thanks very much,
>>
>> James

Re: Fetch Contents of HDFS Directory as a Part of a Larger Flow

2018-05-03 Thread James Srinivasan

We handle a similar situation using CTAS and then retrieve the resulting
data using webhdfs.

James

On Thu, 3 May 2018, 17:18 Bryan Bende,  wrote:

> The two step idea makes sense...
>
> If you did want to go with the OS call you would probably want
> ExecuteStreamCommand.
>
> On Thu, May 3, 2018 at 12:06 PM, Shawn Weeks 
> wrote:
> > I'm thinking about ways to do the operation in two steps where the first
> > request starts the process of generating the data and returns an uuid and
> > the second request can check on the status and download the file. Still
> have
> > to workout how to collect the output from the Hive table so I'll look at
> the
> > rest calls. Not sure of a good way to make an OS call as ExecuteProcess
> > doesn't support inputs either.
> >
> >
> > Thanks
> >
> > Shawn
> >
> > 
> > From: Bryan Bende 
> > Sent: Thursday, May 3, 2018 10:51:03 AM
> > To: users@nifi.apache.org
> > Subject: Re: Fetch Contents of HDFS Directory as a Part of a Larger Flow
> >
> > Another option would be if the Hadoop client was installed on the NiFi
> > node then you could use one of the script processors to make a call to
> > "hadoop fs -ls ...".
> >
> > If the response is so large that it requires heavy lifting of writing
> > out temp tables to HDFS and then fetching those files into NiFi, and
> > most likely merging to a single response flow file, is that really
> > expected to happen in the context of a single web request/response?
> >
> > On Thu, May 3, 2018 at 11:45 AM, Pierre Villard
> >  wrote:
> >> Hi Shawn,
> >>
> >> If you know the path of the files to retrieve in HDFS, you could use
> >> FetchHDFS processor.
> >> If you need to retrieve all the files within the directory created by
> >> Hive,
> >> I guess you could list the existing files calling the REST API of
> WebHDFS
> >> and then use the FetchHDFS processor.
> >>
> >> Not sure that's the best solution to your requirement though.
> >>
> >> Pierre
> >>
> >> 2018-05-03 17:35 GMT+02:00 Shawn Weeks :
> >>>
> >>> I'm building a rest service with the HTTP Request and Response
> Processors
> >>> to support data extracts from Hive. Since some of the extracts can be
> >>> quiet
> >>> large using the SelectHiveQL Processor isn't a performant option and
> >>> instead
> >>> I'm trying to use on demand Hive Temporary Tables to do the heavy
> lifting
> >>> via CTAS(Create Table as Select). Since GetHDFS doesn't support an
> >>> incoming
> >>> connection I'm trying to figure out another way to fetch the files Hive
> >>> creates and return them as a download in the web service. Has anyone
> else
> >>> worked out a good solution for fetching the contents of a directory
> from
> >>> HDFS as a part of larger flow?
> >>>
> >>>
> >>> Thanks
> >>>
> >>> Shawn
> >>
> >>
>

NiFi ExecuteScript vs multiple processors vs custom processor

2018-07-09 Thread James Srinivasan

Hi all,

I was wondering if there is any general guidance about when to use
ExecuteScript and when to use a chain of processors? For example, in
one application I am downloading a HTML index file, extracting the
links corresponding to more index pages of data per year, fetching
those pages, extracting some more links per month and then downloading
the results. I'm currently doing this with a bunch of NiFi processors
(about 20 in total), whereas I could replace them all by a single
fairly simple Python or Groovy script called by ExecuteScript.

In another application, I have written a custom processor but probably
could have written the same code in a script.

Any guidance on how to choose between the three options would be much
appreciated (yay for choices!)

Thanks very much,

James

NiFi protobuf processor

2018-08-16 Thread James Srinivasan

I've written a quick NiFi protobuf to json processor. Before I go through
my organisation's admin to contribute it, is it likely to be accepted?
Mostly because I'm lazy, it doesn't use the new record functionality and
only supports protobuf decoding.

Thanks

James

Re: NiFi protobuf processor

2018-08-20 Thread James Srinivasan

Great idea - done, thanks for the suggestion.

James
On Thu, 16 Aug 2018 at 19:34, Joe Witt  wrote:
>
> James,
>
> I'd recommend you create or edit an existing JIRA with more of the
> details of the idea and how it would work.  Use that to gauge whether
> there is enough interest perhaps before you go through the
> organization work of getting it released.
>
> It not leveraging the record reader/writer approach probably would
> impact the excitement level but that could just be me assuming what it
> does and the use cases it would be for.
>
> Thanks
> On Thu, Aug 16, 2018 at 2:17 PM Otto Fowler  wrote:
> >
> > You should send this to the dev list
> >
> >
> > On August 16, 2018 at 13:22:29, James Srinivasan 
> > (james.sriniva...@gmail.com) wrote:
> >
> > I've written a quick NiFi protobuf to json processor. Before I go through 
> > my organisation's admin to contribute it, is it likely to be accepted? 
> > Mostly because I'm lazy, it doesn't use the new record functionality and 
> > only supports protobuf decoding.
> >
> > Thanks
> >
> > James

Re: Problem with the GetFile processor deleting my entire installation

2021-06-04 Thread James Srinivasan

You can use the two permissions detailed here

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.13.2/org.apache.nifi.processors.standard.GetFile/index.html

On Fri, 4 Jun 2021, 21:11 Ruth, Thomas,  wrote:

> No matter how secure I make the cluster, what’s to prevent a user from:
>
>- Using GetFile to copy the NiFi configuration files and keystores
>into flows.
>- Creating flows which contain python scripts that utilize the
>keys/secrets in the nifi configuration files to decrypt the contents of the
>keystores.
>- Executing those flows to gain access to the private keys used to
>encrypt data in NiFi
>- Using that data to gain unauthorized access to the cluster to
>further compromise it, possibly by generating their own admin-level user
>certificates.
>
>
>
> I’m genuinely stumped here as to how to create a solution that is
> production ready and that prevents a complete system failure by a user
> accidentally supplying the wrong directory to a processor component or
> generally creating python scripts that execute as the nifi user, being able
> to basically use a NiFi server as their own script-running playground.
>
>
>
> Tom
>
>
>
> *From:* Mike Sofen 
> *Sent:* Friday, June 4, 2021 2:02 PM
> *To:* users@nifi.apache.org
> *Subject:* RE: Problem with the GetFile processor deleting my entire
> installation
>
>
>
> Setting aside the user auth/permissions issue (which sounds like, from
> your initial email, that you nailed perfectly with a secured cluster), I’ll
> address this accidental deletion issue using what I consider a general
> design pattern for Nifi (or any process that write-touches files):
>
>- I don’t allow physical file deletes (which happen with the default
>GetFiles settings)
>- I create an archive folder for files I’ve processed if I want/need
>to move them out of the processing folder
>- I always set the “leave files” flag
>- I use either the timestamp or filename tracker to eliminate repeated
>processing if I’m not moving the files
>- OR – I use Move File flag to the archive folder destination.
>- And I advise new users to always spec a small test folder to start,
>and to double-check that the remove files is turned off.
>
>
>
> In some cases, like when I’m processing text files in place in secure
> repo, I can’t move the files so I use the timestamp eval method.
>
>
>
> In others, like log files where I definitely want to archive them out of
> that processing folder, I use the move model.
>
>
>
> Don’t get discouraged about this twist in the road – you’ll find Nifi to
> be a truly exceptional product for pipeline automation, wildly more
> powerful/flexible/stable than any other product out there (I’ve used
> most).  I’ve used it IoT to DB to ML processing, document processing,
> metrology data processing, you name it.  In particular, the built data
> provenance and auditability will be valuable for your situation at Optum.
> All the best,
>
>
>
> Mike
>
>
>
> *From:* Ruth, Thomas 
> *Sent:* Friday, June 04, 2021 12:26 PM
> *To:* users@nifi.apache.org
> *Subject:* RE: Problem with the GetFile processor deleting my entire
> installation
>
>
>
> Is it recommended that after installing NiFi, that I then proceed to
> remove read permissions from all installation files and directories in
> order to protect them from removal by users? Will this present a problem
> running the software if it’s unable to read any files?
>
>
>
> So for fun, I loaded up a quick nifi container:
> docker run –name nifi -p 8080:8080 -d apache/nifi:latest
>
>
>
> Connected to localhost:8080/nifi
>
>
>
> Created a GetFile processor, with the directory /opt/nifi. Connected the
> success to a LogAttributes. Hit Run….
>
>
>
> Now I get this:
> *HTTP ERROR 404 /nifi/canvas.jsp*
>
> *URI:*
>
> /nifi/
>
> *STATUS:*
>
> 404
>
> *MESSAGE:*
>
> /nifi/canvas.jsp
>
> *SERVLET:*
>
> jsp
>
>
>
> This behavior shouldn’t be possible, in my opinion. It’s putting security
> in the hands of my developers. I really am looking for a solution to this
> issue.
>
>
>
> The help text says this:
> Creates FlowFiles from files in a directory. NiFi will ignore files it
> doesn't have at least read permissions for.
>
>
>
> So as you suggested, I removed the read permission recursively from all
> files in /opt/nifi. After doing this, nifi no longer starts.
>
>
>
> find /opt/nifi -print > /tmp/files.txt
>
> for I in `cat /tmp/files.txt`; do chmod a-r $i; done
>
>
>
> I also went to https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator and
> attempted to calculate a CVSS score for this vulnerability. I ended up
> calculating a score of 8.0.
>
>
>
> Tom
>
>
>
> *From:* Russell Bateman 
> *Sent:* Friday, June 4, 2021 11:20 AM
> *To:* users@nifi.apache.org
> *Subject:* Re: Problem with the GetFile processor deleting my entire
> installation
>
>
>
> Oh, sorry, to finish the answer, yes, you do need to be *very* careful
> how you specify the In

Re: Error on nifi start

2022-12-13 Thread James Srinivasan

I think nifi currently supports java 8 or 11, not 17:

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#system_requirements

On Tue, 13 Dec 2022, 12:45 James McMahon,  wrote:

> I am using an Ansible role from Ansible GALAXY that has been tested and
> validated up through Apache NiFi v1.14.0. I download and install 1.14.0.bin
> from the Apache NiFi archives fir this reason.
>
> I am using ansible to install on and AWS EC2 instance. My java version on
> this instance is:
>
> openjdk 17.0.5 2022-10-18 LTS
>
> OpenJDK Runtime Environment Corretto-17.0.5.8.1
>
> The install goes well. But when nifi attempts to start, it fails with the
> following error message. Is this error indicating a compatibility issue
> with the java installation on AWS? How should I proceed to get nifi to
> start?
>
> 2022-12-13 02:50:39,316 ERROR [main] org.apache.nifi.NiFi Failure to
> launch NiFi due to org.xerial.snappy.SnappyError:
> [FAILED_TO_LOAD_NATIVE_LIBRARY] Unable to make p
>
> rotected final java.lang.Class
> java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
> throws java.lang.ClassFormatError acce
>
> ssible: module java.base does not "opens java.lang" to unnamed module
> @31b289da
>
> org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] Unable to
> make protected final java.lang.Class
> java.lang.ClassLoader.defineClass(java.lang.String,byte[]
>
> ,int,int,java.security.ProtectionDomain) throws java.lang.ClassFormatError
> accessible: module java.base does not "opens java.lang" to unnamed module
> @31b289da
>
> at
> org.xerial.snappy.SnappyLoader.injectSnappyNativeLoader(SnappyLoader.java:297)
>
> at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:227)
>
> at org.xerial.snappy.Snappy.(Snappy.java:48)
>
> at
> org.apache.nifi.processors.hive.PutHiveStreaming.(PutHiveStreaming.java:158)
>
> at java.base/java.lang.Class.forName0(Native Method)
>
> at java.base/java.lang.Class.forName(Class.java:467)
>
> at
> org.apache.nifi.nar.StandardExtensionDiscoveringManager.getClass(StandardExtensionDiscoveringManager.java:328)
>
> at
> org.apache.nifi.documentation.DocGenerator.documentConfigurableComponent(DocGenerator.java:100)
>
> at
> org.apache.nifi.documentation.DocGenerator.generate(DocGenerator.java:65)
>
> at
> org.apache.nifi.web.server.JettyServer.start(JettyServer.java:1126)
>
> at org.apache.nifi.NiFi.(NiFi.java:159)
>
> at org.apache.nifi.NiFi.(NiFi.java:71)
>
> at org.apache.nifi.NiFi.main(NiFi.java:303)
>
>
>
>
>
>
>

Re: UI SocketTimeoutException - heavy IO

2023-03-22 Thread James Srinivasan

Apologies in advance if I've got this completely wrong, but I recall that
error if I forget to increase the limit of open files for a heavily loaded
install. It is more obvious via the UI but the logs will have error
messages about too many open files.

On Wed, 22 Mar 2023, 16:49 Mark Payne,  wrote:

> OK. So changing the checkpoint internal to 300 seconds might help reduce
> IO a bit. But it will cause the repo to become much larger, and it will
> take much longer to startup whenever you restart NiFi.
>
> The variance in size between nodes is likely due to how recently it’s
> checkpointed. If it stays large like 31 GB while the other stay small, that
> would be interesting to know.
>
> Thanks
> -Mark
>
>
> On Mar 22, 2023, at 12:45 PM, Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
>
> Thanks for this Mark.  I'm not seeing any large attributes at the moment
> but will go through this and verify - but I did have one queue that was set
> to 100k instead of 10k.
> I set the nifi.cluster.node.connection.timeout to 30 seconds (up from 5)
> and the nifi.flowfile.repository.checkpoint.interval to 300 seconds (up
> from 20).
>
> While it's running the size of the flowfile repo varies (wildly?) on each
> of the nodes from 1.5G to over 30G.  Disk IO is still very high, but it's
> running now and I can use the UI.  Interestingly at this point the UI shows
> 677k files and 1.5G of flow.  But disk usage on the flowfile repo is 31G,
> 3.7G, and 2.6G on the 3 nodes.  I'd love to throw some SSDs at this
> problem.  I can add more nifi nodes.
>
> -Joe
> On 3/22/2023 11:08 AM, Mark Payne wrote:
>
> Joe,
>
> The errors noted are indicating that NiFi cannot communicate with
> registry. Either the registry is offline, NiFi’s Registry Client is not
> configured properly, there’s a firewall in the way, etc.
>
> A FlowFile repo of 35 GB is rather huge. This would imply one of 3 things:
> - You have a huge number of FlowFiles (doesn’t seem to be the case)
> - FlowFiles have a huge number of attributes
> or
> - FlowFiles have 1 or more huge attribute values.
>
> Typically, FlowFile attribute should be kept minimal and should never
> contain chunks of contents from the FlowFile content. Often when we see
> this type of behavior it’s due to using something like ExtractText or
> EvaluateJsonPath to put large blocks of content into attributes.
>
> And in this case, setting Backpressure Threshold above 10,000 is even more
> concerning, as it means even greater disk I/O.
>
> Thanks
> -Mark
>
>
> On Mar 22, 2023, at 11:01 AM, Joe Obernberger
>   wrote:
>
> Thank you Mark.  These are SATA drives - but there's no way for the
> flowfile repo to be on multiple spindles.  It's not huge - maybe 35G per
> node.
> I do see a lot of messages like this in the log:
>
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=861d3b27-aace-186d-bbb7-870c6fa65243,name=TIKA
> Handle Extract Metadata] with Flow Registry because could not retrieve
> version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in
> bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused
> (Connection refused)
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=bcc23c03-49ef-1e41-83cb-83f22630466d,name=WriteDB]
> with Flow Registry because could not retrieve version 2 of flow with
> identifier ff197063-af31-45df-9401-e9f8ba2e4b2b in bucket
> 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
> refused)
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=bc913ff1-06b1-1b76-a548-7525a836560a,name=TIKA
> Handle Extract Metadata] with Flow Registry because could not retrieve
> version 1 of flow with identifier d64e72b5-16ea-4a87-af09-72c5bbcd82bf in
> bucket 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused
> (Connection refused)
> 2023-03-22 10:52:13,960 ERROR [Timer-Driven Process Thread-62]
> o.a.nifi.groups.StandardProcessGroup Failed to synchronize
> StandardProcessGroup[identifier=920c3600-2954-1c8e-b121-6d7d3d393de6,name=Save
> Binary Data] with Flow Registry because could not retrieve version 1 of
> flow with identifier 7a8c82be-1707-4e7d-a5e7-bb3825e0a38f in bucket
> 736a8f4b-19be-4c01-b2c3-901d9538c5ef due to: Connection refused (Connection
> refused)
>
> A clue?
>
> -joe
> On 3/22/2023 10:49 AM, Mark Payne wrote:
>
> Joe,
>
> 1.8 million FlowFiles is not a concern. But when you say “Should I reduce
> the queue sizes?” it makes me wonder if they’re all in a single queue?
> Generally, you should leave the backpressure threshold at the default
> 10,000 FlowFile max. Increasing this can lead to huge amounts of swapping,
> which will drastically reduce performance and increase disk util

Re: Configuring ExecuteStreamCommand on jar flowfiles

2023-12-03 Thread James Srinivasan

Since a jar file is mostly just a standard zip file, can you use a built in
processor instead?

On Sun, 3 Dec 2023, 15:36 James McMahon,  wrote:

> I have a large volume of a wide variety of incoming data files. A subset
> of these are jar files. Can the ExecuteStreamCommand be configured to run
> the equivalent of
>
> jar -xf ${flowfile}
>
> and will that automatically direct each output file to a new flowfile, or
> does ESC need to be told to direct each output file from jar standard out
> to the Success path out of ESC?
>
> Thank you in advance for any assistance.
>

50 matches

Mail list logo