Re: Use Case...Please help

Conrad Crampton Tue, 24 May 2016 04:55:19 -0700

I suspect then that mapping the network drive on the NiFi cluster is out of the 
question as is a standalone NiFi instance on it.
Other options then – are these logs ‘emitted’ by syslog? If so use ListenSyslog 
processor, if not then I’m struggling. Perhaps you could spin up a lightweight 
machine/ server/ docker container that you can map to the Philips network drive 
then use a local Nifi instance as I suggested before?


There are many other ingestion processors which you could explore – I haven’t 
used any others so can’t help but the docs give a good run down of these. Does 
the Philips Network drive have an API to interact with it if so, you could use 
ExecuteProcess/Script processor.

Conrad

From: "Tripathi, Shiv Deepak" 
<shiv.deepak.tripa...@philips.com<mailto:shiv.deepak.tripa...@philips.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Tuesday, 24 May 2016 at 12:48
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: RE: Use Case...Please help

You want to move data from Philips network (mapped drive) to HDFS (Amazon 
cloud) with a NiFi installation hosted on Amazon too? Yes(NiFi installation on 
amazon only).

when i say "added as a service in Hortonworks Hadoop cluster.”- I mean that I 
have created a hadoop cluster on cloud and installed nifi on the Hadoop cluster.

Philips network drive  is a storage drive. I don’t think NiFi can’t be 
installed there as it is not a server or machine.

Thanks,
Deepak


From: Conrad Crampton [mailto:conrad.cramp...@secdata.com]
Sent: Tuesday, May 24, 2016 4:31 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: Re: Use Case...Please help

So as far as I understand this
You want to move data from Philips network (mapped drive) to HDFS (Amazon 
cloud) with a NiFi installation hosted on Amazon too?

As I said before to use ListFiles or FetchFile the mapped drive has to be local 
to the NiFi server. As your NiFi is running on a remote cloud server (as you 
stated – I don’t quite understand what you mean when you say "added as a 
service in Hortonworks Hadoop cluster.”).
If you need to get these log files from a remote machine you could use an 
instance of Nifi running on that machine (where Philips network data is 
generated) then use ListFiles -> GetFile then use a combination of Remote 
Process Groups/ Site to Site communication using output port (on the local 
version) and input port (on cluster) with you clustered NiFi version. This is 
something that has been recommended to me for a similar use case in a previous 
thread (I haven’t tried it out yet though). Other alternatives would be use 
ListSFTP and set up an SFTP server on that machine.
Once you have the picked up the files in the clustered NIFi, depending on what 
you want to do with the data in those files, SplitText processor (to make 
multiple FlowFiles (per line typically), the ExtractText (to parse those lines 
to get attribute data), UpdateAttributes to further process, the probably 
finally MergeProcessor to create tar files of these lines then PutHDFS to 
finally store (probably using attribute data to partition appropriately).

I’ve made a lot of assumptions again as I still am not totally sure of what you 
want to do, but hopefully you have some pointers to move you forward.

Regards
Conrad


From: "Tripathi, Shiv Deepak" 
<shiv.deepak.tripa...@philips.com<mailto:shiv.deepak.tripa...@philips.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Date: Tuesday, 24 May 2016 at 11:10
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
<users@nifi.apache.org<mailto:users@nifi.apache.org>>
Subject: RE: Use Case...Please help

Thanks for the time you spent on my use case.

I mounted one of my windows drive where my test input data resides and its 
working.

As of now I am not satisfied with the implementation which I did. I am trying 
to extend it.

My real use case is:

SourceLocation: Philips Network

Screenshot1: “SourceDir.jpg”
Highlighted drive is my network drive which I mapped to my machine currently. I 
want nifi to pick file directly from here .which is if I don’t map it to my 
machine den also it should pick from the network drive by specifying the path.

Nifi Installation:

Nifi is installed on cloud ec2 instances and added as a service in Hortonworks 
Hadoop cluster.

Destination:

S3 and HDFS on Amazon cloud:

Could you please assist me in which order I need to select processor and how to 
specify the path.

Thanks,
Deepak


From: Simon Elliston Ball [mailto:si...@simonellistonball.com]
Sent: Monday, May 23, 2016 6:25 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: Re: Use Case...Please help

Hi Deepak,

It looks like your flow is following the right kind of direction, so I suspect 
there’s something about the path that isn’t working out. One solution would be 
to use a mapped drive on your machine, which makes it a little simpler, 
however, it would be nice if we could get it working with the unc path as well.
Are you getting any validation messages on the ListFile processor, either on 
the bulletin board in Nifi ([cid:image001.png@01D1B5DE.BAEBAE70]) or in the 
nifi-app.log file?

Note that you will have to be connected to the drive to ensure you have 
credentials, or have you nifi user be able to connect to that drive with its 
windows credentials. There isn’t currently a means to provide authentication 
per share in the processor, but nifi should inherit the credential context of 
whichever user is running the nifi process.

Hope that helps,
Simon


On 22 May 2016, at 18:40, Tripathi, Shiv Deepak 
<shiv.deepak.tripa...@philips.com<mailto:shiv.deepak.tripa...@philips.com>> 
wrote:

Hi Mark,

In order to implement apache nifi.

I downloaded hortonworks sandbox and installed apache nifi on that. Its working 
fine in below scenario.

Scenario 1: My input directory is in local file system on HDP(screenshot name 
“listfilelocaldir”) and output is on HDFS file system.

For all processor in dataflow please see screenshot – “HDP sandbox local to 
HDFS”

Scenario 2: could you please tell me which processor and in what order I need 
to use if I want to send file 
from\\btc7n001\Ongoing-MR\MRI\Deepak<file:///\\btc7n001\Ongoing-MR\MRI\Deepak>  
(password enable network drive mapped to my machine) to HDP cluster created in 
VMplayer.

Its not recognizing the input directory at all. Please see the screenshot 
name-“Usecaseinputdir.jpeg”

Please help me.

Thanks,
Deepak


From: Mark Payne [mailto:marka...@hotmail.com]
Sent: Monday, May 16, 2016 6:19 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: Re: Use Case...Please help

Deepak,

Yes, you should be able to do so.

Thanks
-Mark

On May 16, 2016, at 8:44 AM, Tripathi, Shiv Deepak 
<shiv.deepak.tripa...@philips.com<mailto:shiv.deepak.tripa...@philips.com>> 
wrote:

Thanks a lot Mark.

Looking forward to try it out.

If I understood correctly than I can drop log copying script and staging 
machine and can directly pull the logs from repository.

Please confirm.

Thanks,
Deepak

From: Mark Payne [mailto:marka...@hotmail.com]
Sent: Monday, May 16, 2016 5:06 PM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: Re: Use Case...Please help

Deepak,

Thanks for providing such a detailed description of your use case. I think NiFi 
would be an excellent
tool to help you out here!

As I mentioned before, you would typically use ListFile -> FetchFile to pull 
the data in. Clearly, here,
though, you want to be more selective about what you pull in. You can 
accomplish this by using a
RouteOnAttribute processor. So you'd have something like: ListFile -> 
RouteOnAttribute -> FetchFile.
The RouteOnAttribute processor is very powerful and allows you to configure how 
to route each piece
of data based on whatever attributes are available. The ListFile Processor adds 
the following attributes
to each piece of data that it pulls in:

filename (name of file)
path (relative path of file)
absolute.path (absolute directory of file)
fs.owner (owner of the file)
fs.group (group that the file belongs to)
fs.lastModified (last modified date)
fs.length (file length)
fs.permissions (file permissions, such as rw-rw-r--)

From these, you can make all sorts of routing decisions, based on name, 
timestamp, etc. You can choose
to terminate data that does not meet your criteria.

When you use FetchFile, you have the option of deleting the source file, moving 
it elsewhere, or leaving
it as-is. So you wouldn't need to delete it if you don't want to. This is 
possible because ListFile keeps track
of what has been 'listed'. So it won't ingest duplicate data, but it will pick 
up new files (if any existing
file is modified, it will pick up the new version of the file.)

You can then use UnpackContent if you want to unzip the data, or you can leave 
it zipped. After the FetchFile,
you can also use a RouteOnAttribute processor to separate out the XML from the 
log files and put those to
different directories in HDFS.

Does this sound like it will provide you all that you need?

Thanks
-Mark



On May 16, 2016, at 3:06 AM, Tripathi, Shiv Deepak 
<shiv.deepak.tripa...@philips.com<mailto:shiv.deepak.tripa...@philips.com>> 
wrote:

Hi Mark,

I am very happy to see the detailed reply. I am very thankful to you. So 
explaining more about my use case below.

1-      Screenshot Name--> “Stagingdirectory_copiedfiles”

Log copying script will copy the log in the  staging directory which is in my 
case “D:\PaiValidation” and will maintain multiple folders. These folders are 
nothing but device serial no. Every serial no will have multiple log files and 
xml files as on  Each day one new log files used to come in this directory, as 
you can see.

In log copy script we defined that how many days logs we want. So lets say we 
passed 360 , so it will copy logs from last 360 days and as it is continuously 
running so after 10 days of when you passed configuration very first time it 
will have logs 360(last 360 days from the time when you passed this parameter 
to script) +10 days+ growing day by day= 370++++++

And after pushing the files to cluster we are renaming or creating dummy files 
with 0 byte as you can see in screenshot

Also we are passing one more parameter which specifies the device serial no of 
which we want logs not from all devices.

2-      The source repository
Screenshot Name --> “Repository files”

This is actual repository from where we are taking the logs and copying it to 
staging directory. These are incoming logs from the device and every serial no 
is having multiple types of files as you can see in screenshot.  We need only  
log files with log************.zip pattern and xml file and rest of them we 
will not pick up. Also these logs in repository we can’t delete.


3-      HDFS directory

From staging directory flume is  moving files to our on HDFS premise cluster
Screenshot Name--> HDFS1

You can see two highlighted folder in this screenshot one will have only log 
files other will have xml files.
If you go back and see the “Stagingdirectory_copiedfiles” you will find xml and 
log files under same device serial no which we are storing separately in cluster

Screenshot name --> hdfs2

Logs will be stored under same directory structure as it was in staging. For 
both xml files and log files.



So if I want to accomplish above goals nifi will be best solution?

If I use nifi directly to the repository to pull the logs whether I can be able 
to do these few things:

1-      It should not copy duplicate logs as from destination we will be 
deleting logs.
2-      It should only copy the logs of last 20 days or last 50 days like any 
days and if the new logs comes in directory each day it should pull up that too.
3-      It should not delete any logs from the source repository.
4-      It should copy specified logs in one directory and xml in other 
directory in HDFS.

In such a case we can remove the concept of script.

Hoping for best.

Thanks,
Deepak



From: Mark Payne [mailto:marka...@hotmail.com]
Sent: Monday, May 16, 2016 1:25 AM
To: users@nifi.apache.org<mailto:users@nifi.apache.org>
Subject: Re: Use Case...Please help

Hi Deepak,

Certainly, this is something that you could use NiFi for. We often see people 
using NiFi to sync data from
a directory on local disk to a directory in HDFS. This is typically 
accomplished by using a flow like:

ListFile -> FetchFile -> PutHDFS

You can then create a file in the source directory with the same name by using 
ReplaceText to set the content
to nothing and then PutFile to write out the 0-byte content. So the flow would 
look like:

ListFile -> FetchFile -> PutHDFS -> ReplaceText -> PutFile

PutHDFS has a "Directory" property. If you set this value to "${path}" it will 
use the same directory structure that
ListFile found the file to be in when it performed the listing. I.e., if you 
set ListFile to pull from /data/mydir
and "Recurse Subdirectories" to true, then any file found in /data/mydir will 
have a 'path' of './' and anything found in
/data/mydir/subdir1 will have a path of './subdir1'. If you would rather have 
the fully qualified path (/data/mydir/subdir1)
you would use "${absolute.path}" instead of "${path}"

One thing that I find curious about your scenario though is the concept of a 
'log copy script' and then putting back
a 0-byte file so that the script does not pick up the data again. Why not just 
use NiFi to pull directly from the source
and avoid using a script all together? The ListFile processor will keep track 
of what has been pulled in already,
so it won't copy the data multiple times. But I may not be clear on this point. 
Is the "Log repository" that you mention
just a directory that NiFi could pull from, or is it some other sort of 
repository?

Thanks
-Mark



On May 15, 2016, at 3:23 PM, Tripathi, Shiv Deepak 
<shiv.deepak.tripa...@philips.com<mailto:shiv.deepak.tripa...@philips.com>> 
wrote:

Hi

Currently I am using flume for data ingestion and my use case as follows

Log repository--------log copy Script-------> Staging directory for  copied logs

Staging directory for  copied logs folder structure----Machine1log----a.log
                                                                                
                                              -----b.log
                                                                                
                    Machine2log----a.log
                                                                                
                                              -----b.log

Flume will copy these logs and replicate same structure in HDFS cluster. 
Beginning with which is :
                                                                                
                /user/hdfs/Machine1log----a.log
                                                                                
                                                          -----b.log
                                                                                
                                  Machine2log----a.log
                                                                                
                                                            -----b.log


And creates 0 byte dummy file with same name so that Script wont copy the same 
log again as it find 0 byte file already existing in source directory.


Can we do same things with apache nifi?

Keeping in mind two goals- same folder structure in HDFS and after moving file 
to HDFS it should crete 0 byte dummy file in source directory.


Please help

Thanks,
Deepak




With Best Regards,
Deepak Tripathi
Philips Innovation campus
Bangalore-560045
<image001.png>


________________________________
The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

<repository files.jpg><Stagingdirectory_copiedfiles.jpg><HDFS1.jpg><hdfs2.jpg>

<listfilelocaldir.JPG><HDP sandbox local to HDFS.JPG><Usecaseinputdir.JPG>



***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report 
this email as spam.

SecureData, combating cyber threats

________________________________

The information contained in this message or any of its attachments may be 
privileged and confidential and intended for the exclusive use of the intended 
recipient. If you are not the intended recipient any disclosure, reproduction, 
distribution or other dissemination or use of this communications is strictly 
prohibited. The views expressed in this email are those of the individual and 
not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if 
followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered 
Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, 
ME16 9NT

Re: Use Case...Please help

Reply via email to