Re: NIFI Usage for Data Transformation

2018-10-31 Thread Mark Rachelski
Ameer,

Allow me to provide an opinion on this as a user, not as one of the awesome
guys in this group that has built this very cool tool. They are likely to
be very enthusiastic.

In my experience, NiFi is not going to easily replace a custom tool that
has done a number of complex transforms. For example, we still do also use
PentahoDI for some level of elaborate ETL flows.  NiFi does have some basic
capabilities for doing text manipulation including:

   - Jolt Transformer - focused on JSON transforms
   - Regex processing - Straight-up text search/replace
   - Some level of splitting and joining files
   - I am sure someone will point out that you can embed your own custom
   transformer by writing a  NiFi processor (both native or scripted) or
   invoking your own scripts from a NiFi processor. And all of this is
   possible. Although this does require the security of the NiFi server itself
   to be relaxed a bit so that the development team building these scripts to
   deploy them to the server running NiFi. (especially if the scripts are
   external to the flow)
   - Furthermore, NiFi can be used to trigger some other tools that are
   doing heavy transforms such as Spark jobs or other Hadoop-based transforms
   if you have to manipulate large data sets.

What you will likely find is if your transforms require some level of data
joins inside the ETL pipeline that are currently being done by your custom
scripts, NiFi will not be able to help with that without resorting to
driving that through a data engine of some type (be it SQL engine, Hadoop,
etc...)

But, admittedly by guessing, I suspect the thing that your Enterprise
groups are going for is the very natural support for non-functional
requirements that NiFi or any framework worth anything will provide. That
is things like:

   - Monitoring
   - Consistent reporting
   - Provenance of data transforms and traceability
   - Packing a large number of flows into a few machines

With this, they can possibly host NiFi as a service and your teams simply
contribute flows.

At my company, we do not build our data ingest world exclusively around
NiFi, but we do use it pretty widely to get disparate data sets into our
data platform and it helps a lot. It means that the developers can focus on
writing good ingest flows instead of continually ensuring that every custom
script feeds into the non-functional requirements of your own data
platform. But admittedly, any good framework will also help you to address
those requirements with little repeat.

We also still keep some custom ETL jobs that are just not worth the effort
of porting to NiFi. We don't see those jobs as being technical debt. They
work and are not causing us any issues. But they had to be independently
developed to meet those non-functional requirements.

The best thing I can suggest is that you try it and see. As a long time
imperative programmer, I did find it a bit difficult to get my head wrapped
around the dynamic nature of building flows. But once you invest the effort
to learn, it becomes a pretty cool tool in your toolbox. Eventually,
writing flows becomes faster than writing imperative scripts and the
testing cycle is significantly shorter.

Regards,
Mark.

On Thu, Nov 1, 2018 at 4:04 AM Ameer Mawia  wrote:

>
> We have a use case where we take data from a source(text data in csv
> format), do transformation and manipulation of textual record, and output
> the data in another (csv)format. This is being done by a Java based custom
> framework, written specifically for this *transformation* piece.
>
> Recently as Apache NIFI is being adopted at enterprise level by the
> organisation, we have been asked to try *Apache NIFI* and see if can use
> that as a replacement to this custom tool?
>
> *My question is*:
>
>- How much leverage does *Apache NIFI *provides on the flowfile *content
>*manipulation?
>
> I understand *NIFI *is good for creating data flow pipeline, but is it
> good for *extensive TEXT Transformation* as well?   So far I have not
> found obvious way to achieve that.
>
> Appreciate the feedback.
>
> Thanks,
>
> --
> http://ca.linkedin.com/in/ameermawia
> Toronto, ON
>
>
>
> --
> http://ca.linkedin.com/in/ameermawia
> Toronto, ON
>
>


Re: PostHTTP and SSL

2018-10-15 Thread Mark Rachelski
I am not in front of a NiFi server to check, so this is from memory.

The response code was 302... Check to see if you are following redirects
which I think is a separate setting in the InvokeHTTP processor.

Mark

On Mon, Oct 15, 2018 at 1:04 PM Andy LoPresto 
wrote:

> Hi Adam,
>
> There are a couple of tasks I would suggest. First, ensure that you are
> passing the authorization header in the HTTP request. The SSLContextService
> allows you to verify the external service public certificate, but if that
> service requires an authorization token, you will still need to provide it
> via a header.
>
> Second, PostHTTP is an older processor, and I would recommend using
> InvokeHTTP as it is a more modern and robust processor and supports all the
> HTTP operations.
>
> Andy LoPresto
> alopre...@apache.org
> alopresto.apa...@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Oct 14, 2018, at 00:30, Adam Preston  wrote:
>
> Hi Everyone
>
> I am using the PostHTTP processor to POST up some XML I transformed. I
> have even configured the SSL Context and everything.
>
>
> Problem is I am getting this in the logs:
>
> 2018-10-12 22:59:18,669 INFO [NiFi Web Server-73]
> o.a.n.c.s.StandardProcessScheduler Starting
> PostHTTP[id=2c63900a-99b7-3c1d-50cd-5530e0fba5e0]
> 2018-10-12 22:59:18,675 INFO [Timer-Driven Process Thread-5]
> o.a.n.c.s.TimerDrivenSchedulingAgent Scheduled
> PostHTTP[id=2c63900a-99b7-3c1d-50cd-5530e0fba5e0] to run with 1 threads
> 2018-10-12 22:59:18,772 INFO [Flow Service Tasks Thread-1]
> o.a.nifi.controller.StandardFlowService Saved flow controller
> org.apache.nifi.controller.FlowController@74c6dbf2 // Another save
> pending = false
> 2018-10-12 22:59:20,415 ERROR [Timer-Driven Process Thread-4]
> o.a.nifi.processors.standard.PostHTTP
> PostHTTP[id=2c63900a-99b7-3c1d-50cd-5530e0fba5e0] Failed to Post
> StandardFlowFileRecord[uuid=1f8c6159-5212-4170-9bfe-563339a04566,claim=StandardContentClaim
> [resourceClaim=StandardResourceClaim[id=1539385102154-2, container=default,
> section=2], offset=0,
> length=2694469],offset=0,name=23134109039036,size=2694469] to
> https://myserver.org/api/26/dataValueSets: response code was 302:Found
>
>
> I am able to do with with this curl command just fine:
> curl -H 'Authorization: Basic YWRhbTcwNGE6VGhlYmlnMzZkNSQ=' -H
> "Content-Type: application/xml"  "https://myserver.org/api/26/dataValueSets";
> --data @output.xml
>
> Any thoughts on how I can diagnose this one.
>
> Thanks
>
>
> Sent from Outlook 
>
>


Re: FetchS3 not fetching all objects?

2018-09-20 Thread Mark Rachelski
The S3Fetch processor is a stateful processor using NiFi state storage to
track the time of the most recent object name list extracted from the
bucket. On subsequent runs, it will only pull objects that have an update
time newer than the internally stored time stored in state (the time of the
last S3Fetch run).

Is this possibly contributing to your missing files?

mark.

On Fri, Sep 21, 2018 at 6:21 AM Vets, Laurens  wrote:

> Hello,
>
> I'm using NiFi to read an S3 bucket containing all our AWS CloudTrail
> logs. While debugging an issue, I noticed that not all objects are
> fetched or listed. Basically, some events which I can find manually by
> grepping the S3 files, I can't find in our Kibana dashboard. Is it
> therefor possible that there might be an issue with the S3 processors
> whereby it doesn't pick up all S3 objects?
>
> I'm using NiFi 1.3.0. While reading the release notes for the newer NiFi
> versions, I found https://issues.apache.org/jira/browse/NIFI-4876 and
> wondering whether this might be related?
>
> Can anyone shed some light on this?
>
>


Re: How to start a flow with attributes from the state store?

2017-10-19 Thread Mark Rachelski
Thank you Bryan,

That should fit my purposes well.

BTW: That processor is not in the User Guide.

As a follow on question, is there an easy way to ask NiFi for all
processors that can be used at the beginning of a flow? There are a lot of
other tagging that is done. But I spent a few hours last night googling for
an answer before posting this question into the mail group.

Mark.

On Fri, Oct 20, 2017 at 9:33 AM Bryan Bende  wrote:

> Hi Mark,
>
> You can use GeneratFlowFile as the initial processor to trigger your flow.
>
> Make sure to change the run schedule appropriately otherwise you will get
> a lot of flow files generated.
>
> -Bryan
>
> On Thu, Oct 19, 2017 at 10:08 PM Mark Rachelski 
> wrote:
>
>> I have a scenario where I need to make an HTTP request to an API but
>> taking context into account from previous invocations. Specifically, one
>> query string parameter is a time where the API returns all records from
>> that time or later. Every day, I would issue a new request using the
>> previous time requested.
>>
>> I have worked out that I can store the last time requested in the state
>> store. And use an UpdateAttributes process to retrieve it or initialize it
>> on the first run. I can then feed that into the InvokeHTTP processor and
>> build a dynamic URL from that attribute.
>>
>> But my main problem is that I don't know what beginning processor to use
>> in this flow. UpdateAttribute needs an inbound connection. And there are no
>> obvious 'dummy' beginning processors that I can find in the vast array. The
>> only thing I need from the beginning processor is the schedule tab.
>>
>> Any ideas on what my first processor in this flow should be?
>>
>> Thank you in advance for any help,
>> Mark.
>>
> --
> Sent from Gmail Mobile
>


How to start a flow with attributes from the state store?

2017-10-19 Thread Mark Rachelski
I have a scenario where I need to make an HTTP request to an API but taking
context into account from previous invocations. Specifically, one query
string parameter is a time where the API returns all records from that time
or later. Every day, I would issue a new request using the previous time
requested.

I have worked out that I can store the last time requested in the state
store. And use an UpdateAttributes process to retrieve it or initialize it
on the first run. I can then feed that into the InvokeHTTP processor and
build a dynamic URL from that attribute.

But my main problem is that I don't know what beginning processor to use in
this flow. UpdateAttribute needs an inbound connection. And there are no
obvious 'dummy' beginning processors that I can find in the vast array. The
only thing I need from the beginning processor is the schedule tab.

Any ideas on what my first processor in this flow should be?

Thank you in advance for any help,
Mark.


Re: JoltTransformJSON with array of jsons

2017-10-12 Thread Mark Rachelski
Pierre,
I never noticed the Advanced UI button... thank you.

Chris,

I am very confident this can be done. I just don't have the time to show
you how right now.

>From memory, the right side of your jolt spec needs to direct the values
into the same node... then it will automatically build up an array as
output. The Jolt syntax takes a while to get your head wrapped around.
Start by reading through the spec found here:
https://github.com/bazaarvoice/jolt/blob/master/jolt-core/src/main/java/com/bazaarvoice/jolt/Shiftr.java
.

I was able to ask for help on StackOverflow and this link is an example
that a very nice person helped with that is manipulating arrays both on the
input and output side.
https://stackoverflow.com/questions/45961341/simplifying-google-sheet-json-using-jolt

Mark.

On Thu, Oct 12, 2017 at 5:28 PM Pierre Villard 
wrote:

> I won't be able to provide more help (I'm always doing really simple
> things with Jolt), sorry Chris.
> Maybe others can chime in and give some hints.
>
> Just as a comment Mark, in the Jolt processor configuration view, you have
> an "Advanced UI" button available. If you click on it, it'll open a UI
> allowing you to test your Jolt specification just as you can do on the web
> page you provided. Maybe you already noticed the Advanced UI and feel it's
> not providing enough features... but just wanted to say it in this thread
> in case you missed it.
>
> Pierre
>
> 2017-10-12 12:08 GMT+02:00 Chris Herssens :
>
>> I tried this, but I get only one json back. Where each element has a
>> array of values. I need as result an  array of json. The transformation
>> should be done on each json.
>>
>> Op 12 okt. 2017 12:01 p.m. schreef "Mark Rachelski" > >:
>>
>> Chris,
>>>
>>> This does not work because your Jolt Specification is saying you are
>>> looking for a top-level element called timestamp_start. Try something like
>>> this:
>>>
>>> [
>>>   {
>>> "operation" : "shift",
>>> "spec": {
>>>   "*": {
>>>   "timestamp_start" : "timestamp_start",
>>>   "timestamp_end" : "timestamp_end",
>>>   "custom_primitives" : {
>>>   "*": "&" }
>>>  }
>>> }
>>>   }
>>> ]
>>>
>>> to allow for the array.
>>>
>>> Also, I recommend you don't develop your jolt specification inside
>>> NiFi... NiFi is the runtime processor.
>>>
>>> From a development environment just use this site:
>>> http://jolt-demo.appspot.com/#inception where you can paste in your
>>> starting document and specification to find a spec that does what you want.
>>>   Then after the spec works, you can put it into your JoltTransformJson
>>> processor.
>>>
>>> Mar.k
>>>
>>> On Thu, Oct 12, 2017 at 4:52 PM Chris Herssens 
>>> wrote:
>>>
>>>> Hello Pierre,
>>>>
>>>> I want to flatten the json content.  How to I do this ?
>>>> Now  I use the chain DSL with jolt spec
>>>>
>>>> [
>>>>   {
>>>> "operation" : "shift",
>>>> "spec": {
>>>>   "timestamp_start" : "timestamp_start",
>>>>   "timestamp_end" : "timestamp_end",
>>>>   "custom_primitives" : {
>>>>   "*": "&" }
>>>> }
>>>>   }
>>>> ]
>>>>
>>>> this doesn't work with an array of json only with a single json
>>>>
>>>> Regards,
>>>>
>>>> Chris
>>>>
>>>>
>>>>
>>>> On Thu, Oct 12, 2017 at 11:26 AM, Pierre Villard <
>>>> pierre.villard...@gmail.com> wrote:
>>>>
>>>>> Hi Chris,
>>>>>
>>>>> Yes it is.
>>>>>
>>>>> For example, if I want to add a field in all JSON of my array - I can
>>>>> do (with default DSL) :
>>>>>
>>>>> {
>>>>>   "*": {
>>>>> "mynewfield":"mynewvalue"
>>>>>   }
>>>>> }
>>>>>
>>>>> I'm clearly not a Jolt expert, but others can help you if you're
>>>>> trying to do something more complex.
>>>>>
>>>>> Pierre
>>>>>
>>>>>
>>>>>
>>>>> 2017-10-12 11:04 GMT+02:00 Chris Herssens :
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Is it possible to use the JoltTransformJson processor with an array
>>>>>> of jsons as input
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>
>>>>>
>>>>
>


Re: JoltTransformJSON with array of jsons

2017-10-12 Thread Mark Rachelski
Chris,

This does not work because your Jolt Specification is saying you are
looking for a top-level element called timestamp_start. Try something like
this:

[
  {
"operation" : "shift",
"spec": {
  "*": {
  "timestamp_start" : "timestamp_start",
  "timestamp_end" : "timestamp_end",
  "custom_primitives" : {
  "*": "&" }
 }
}
  }
]

to allow for the array.

Also, I recommend you don't develop your jolt specification inside NiFi...
NiFi is the runtime processor.

>From a development environment just use this site:
http://jolt-demo.appspot.com/#inception where you can paste in your
starting document and specification to find a spec that does what you want.
  Then after the spec works, you can put it into your JoltTransformJson
processor.

Mar.k

On Thu, Oct 12, 2017 at 4:52 PM Chris Herssens 
wrote:

> Hello Pierre,
>
> I want to flatten the json content.  How to I do this ?
> Now  I use the chain DSL with jolt spec
>
> [
>   {
> "operation" : "shift",
> "spec": {
>   "timestamp_start" : "timestamp_start",
>   "timestamp_end" : "timestamp_end",
>   "custom_primitives" : {
>   "*": "&" }
> }
>   }
> ]
>
> this doesn't work with an array of json only with a single json
>
> Regards,
>
> Chris
>
>
>
> On Thu, Oct 12, 2017 at 11:26 AM, Pierre Villard <
> pierre.villard...@gmail.com> wrote:
>
>> Hi Chris,
>>
>> Yes it is.
>>
>> For example, if I want to add a field in all JSON of my array - I can do
>> (with default DSL) :
>>
>> {
>>   "*": {
>> "mynewfield":"mynewvalue"
>>   }
>> }
>>
>> I'm clearly not a Jolt expert, but others can help you if you're trying
>> to do something more complex.
>>
>> Pierre
>>
>>
>>
>> 2017-10-12 11:04 GMT+02:00 Chris Herssens :
>>
>>> Hello,
>>>
>>> Is it possible to use the JoltTransformJson processor with an array of
>>> jsons as input
>>>
>>> Regards,
>>>
>>> Chris
>>>
>>
>>
>