Re: NiFi best practices to manage big flowfiles

2017-04-21 Thread Mark Payne
Simone,

There is a Feature Proposal that was put together on our wiki at [1] that 
proposes a way to have a FlowFile refer to content that lives elsewhere outside 
of the content repo itself. I think this is what you're getting at. It's a 
great idea, but I don't know that any progress has yet been made on it.

If this is something that you're interested in delving into developing we would 
certainly be more than happy to work with you guys on bringing this to fruition.

Thanks
-Mark


[1] https://cwiki.apache.org/confluence/display/NIFI/External+FlowFile+content

Sent from my iPhone

On Apr 21, 2017, at 7:50 AM, Andrew Grande 
mailto:apere...@gmail.com>> wrote:


Let me ask you this. All those processing cli steps, do they change file 
format, content, etc? If yes, NiFi is not doing anything that you aren't doing 
already. E.g. unpacking a file requires space for the original and decompressed 
file to be available.

You can use ListFile and not move any files in NiFi. It will have a full file 
path as an attribute which you can pass around to your tool invocations.

HTH,
Andrew

On Fri, Apr 21, 2017, 7:17 AM Simone Giannecchini 
mailto:simone.giannecch...@geo-solutions.it>>
 wrote:
Dear Andrew,
I am working with Damiano on this, so let me first thank you for your
indications.

The use case is as follows:

- a satellite acquisition is placed on a shared file system. It can be
significative in size, e.g. 10GB
- it has to be pulled through a chain of operations out of a larger
DAG where the elements of the sequence is decided depending on the
data itself
- we will surely create a number of intermediate files as we are going
to use standard CLI tools for the processing
- the resulting file will be placed again in a shared file system to
be served by a cluster of mapping servers to generate maps on the fly

We are getting thousands of this files per day hence we are trying to
minimize file move operations.

If you are still not sleeping, here is the point. Can I avoid, without
having to customize too many parts of NIFI, to bring the original file
into the content repository or we are stretching NIFI to far from its
intended usage patterns?

Thanks for your  patience.


Regards,
Simone Giannecchini
==
GeoServer Professional Services from the experts!
Visit http://goo.gl/it488V for more information.
==
Ing. Simone Giannecchini
@simogeo
Founder/Director

GeoSolutions S.A.S.
Via di Montramito 3/A
55054  Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob:   +39  333 8128928

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

---
AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate.
Il loro utilizzo è consentito esclusivamente al destinatario del
messaggio, per le finalità indicate nel messaggio stesso. Qualora
riceviate questo messaggio senza esserne il destinatario, Vi preghiamo
cortesemente di darcene notizia via e-mail e di procedere alla
distruzione del messaggio stesso, cancellandolo dal Vostro sistema.
Conservare il messaggio stesso, divulgarlo anche in parte,
distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità
diverse, costituisce comportamento contrario ai principi dettati dal
D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely
for the attention and use of the named addressee(s) and may be
confidential or proprietary in nature or covered by the provisions of
privacy act (Legislative Decree June, 30 2003, no.196 - Italy's New
Data Protection Code).Any use not in accord with its purpose, any
disclosure, reproduction, copying, distribution, or either
dissemination, either whole or partial, is strictly forbidden except
previous formal approval of the named addressee(s). If you are not the
intended recipient, please contact immediately the sender by
telephone, fax or e-mail and delete the information in this message
that has been received in error. The sender does not give any warranty
or accept liability as the content, accuracy or completeness of sent
messages and accepts no responsibility  for changes made after they
were sent or for other risks which arise as a result of e-mail
transmission, viruses, etc.


On Fri, Apr 21, 2017 at 1:01 PM, Andrew Grande 
mailto:apere...@gmail.com>> wrote:
> Hi,
>
> First, there won't be multiple copies of a file within NiFi. If you pass
> around the content and don't change it (only attributes), it will merely
> point a reference to it, no more.
>
> You need to decide if you want to delete processed files, this is what
> GetFile does. Might want to look into ListFile/FetchFile instead, it
> maintains internal state of files already processed.
>
> Assuming you want to delete the file from the original location, you can use
> PutFile in your file to write it to the new working directory and connect
> the success

Re: NiFi best practices to manage big flowfiles

2017-04-21 Thread Andrew Grande
Let me ask you this. All those processing cli steps, do they change file
format, content, etc? If yes, NiFi is not doing anything that you aren't
doing already. E.g. unpacking a file requires space for the original and
decompressed file to be available.

You can use ListFile and not move any files in NiFi. It will have a full
file path as an attribute which you can pass around to your tool
invocations.

HTH,
Andrew

On Fri, Apr 21, 2017, 7:17 AM Simone Giannecchini <
simone.giannecch...@geo-solutions.it> wrote:

> Dear Andrew,
> I am working with Damiano on this, so let me first thank you for your
> indications.
>
> The use case is as follows:
>
> - a satellite acquisition is placed on a shared file system. It can be
> significative in size, e.g. 10GB
> - it has to be pulled through a chain of operations out of a larger
> DAG where the elements of the sequence is decided depending on the
> data itself
> - we will surely create a number of intermediate files as we are going
> to use standard CLI tools for the processing
> - the resulting file will be placed again in a shared file system to
> be served by a cluster of mapping servers to generate maps on the fly
>
> We are getting thousands of this files per day hence we are trying to
> minimize file move operations.
>
> If you are still not sleeping, here is the point. Can I avoid, without
> having to customize too many parts of NIFI, to bring the original file
> into the content repository or we are stretching NIFI to far from its
> intended usage patterns?
>
> Thanks for your  patience.
>
>
> Regards,
> Simone Giannecchini
> ==
> GeoServer Professional Services from the experts!
> Visit http://goo.gl/it488V for more information.
> ==
> Ing. Simone Giannecchini
> @simogeo
> Founder/Director
>
> GeoSolutions S.A.S.
> Via di Montramito 3/A
> 55054  Massarosa (LU)
> Italy
> phone: +39 0584 962313
> fax: +39 0584 1660272
> mob:   +39  333 8128928
>
> http://www.geo-solutions.it
> http://twitter.com/geosolutions_it
>
> ---
> AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
> Le informazioni contenute in questo messaggio di posta elettronica e/o
> nel/i file/s allegato/i sono da considerarsi strettamente riservate.
> Il loro utilizzo è consentito esclusivamente al destinatario del
> messaggio, per le finalità indicate nel messaggio stesso. Qualora
> riceviate questo messaggio senza esserne il destinatario, Vi preghiamo
> cortesemente di darcene notizia via e-mail e di procedere alla
> distruzione del messaggio stesso, cancellandolo dal Vostro sistema.
> Conservare il messaggio stesso, divulgarlo anche in parte,
> distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità
> diverse, costituisce comportamento contrario ai principi dettati dal
> D.Lgs. 196/2003.
>
> The information in this message and/or attachments, is intended solely
> for the attention and use of the named addressee(s) and may be
> confidential or proprietary in nature or covered by the provisions of
> privacy act (Legislative Decree June, 30 2003, no.196 - Italy's New
> Data Protection Code).Any use not in accord with its purpose, any
> disclosure, reproduction, copying, distribution, or either
> dissemination, either whole or partial, is strictly forbidden except
> previous formal approval of the named addressee(s). If you are not the
> intended recipient, please contact immediately the sender by
> telephone, fax or e-mail and delete the information in this message
> that has been received in error. The sender does not give any warranty
> or accept liability as the content, accuracy or completeness of sent
> messages and accepts no responsibility  for changes made after they
> were sent or for other risks which arise as a result of e-mail
> transmission, viruses, etc.
>
>
> On Fri, Apr 21, 2017 at 1:01 PM, Andrew Grande  wrote:
> > Hi,
> >
> > First, there won't be multiple copies of a file within NiFi. If you pass
> > around the content and don't change it (only attributes), it will merely
> > point a reference to it, no more.
> >
> > You need to decide if you want to delete processed files, this is what
> > GetFile does. Might want to look into ListFile/FetchFile instead, it
> > maintains internal state of files already processed.
> >
> > Assuming you want to delete the file from the original location, you can
> use
> > PutFile in your file to write it to the new working directory and connect
> > the success relationship to ExecuteStreamCommand.
> >
> > Andrew
> >
> >
> > On Fri, Apr 21, 2017, 5:37 AM damiano.giampa...@geo-solutions.it
> >  wrote:
> >>
> >> Hi list,
> >>
> >> I'm a NiFi newbie and I'm trying to figure out the best way to use it
> as a
> >> batch ingestion system for satellite imagery as raster files.
> >> The files are pushed on the FS by an external system and then they must
> be
> >> processed and published through WMS protocols.
> >> I tried to draft the flow using the NiFi processors available,
> summarizing
> >> I

Re: NiFi best practices to manage big flowfiles

2017-04-21 Thread Simone Giannecchini
Dear Andrew,
I am working with Damiano on this, so let me first thank you for your
indications.

The use case is as follows:

- a satellite acquisition is placed on a shared file system. It can be
significative in size, e.g. 10GB
- it has to be pulled through a chain of operations out of a larger
DAG where the elements of the sequence is decided depending on the
data itself
- we will surely create a number of intermediate files as we are going
to use standard CLI tools for the processing
- the resulting file will be placed again in a shared file system to
be served by a cluster of mapping servers to generate maps on the fly

We are getting thousands of this files per day hence we are trying to
minimize file move operations.

If you are still not sleeping, here is the point. Can I avoid, without
having to customize too many parts of NIFI, to bring the original file
into the content repository or we are stretching NIFI to far from its
intended usage patterns?

Thanks for your  patience.


Regards,
Simone Giannecchini
==
GeoServer Professional Services from the experts!
Visit http://goo.gl/it488V for more information.
==
Ing. Simone Giannecchini
@simogeo
Founder/Director

GeoSolutions S.A.S.
Via di Montramito 3/A
55054  Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob:   +39  333 8128928

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

---
AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate.
Il loro utilizzo è consentito esclusivamente al destinatario del
messaggio, per le finalità indicate nel messaggio stesso. Qualora
riceviate questo messaggio senza esserne il destinatario, Vi preghiamo
cortesemente di darcene notizia via e-mail e di procedere alla
distruzione del messaggio stesso, cancellandolo dal Vostro sistema.
Conservare il messaggio stesso, divulgarlo anche in parte,
distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità
diverse, costituisce comportamento contrario ai principi dettati dal
D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely
for the attention and use of the named addressee(s) and may be
confidential or proprietary in nature or covered by the provisions of
privacy act (Legislative Decree June, 30 2003, no.196 - Italy's New
Data Protection Code).Any use not in accord with its purpose, any
disclosure, reproduction, copying, distribution, or either
dissemination, either whole or partial, is strictly forbidden except
previous formal approval of the named addressee(s). If you are not the
intended recipient, please contact immediately the sender by
telephone, fax or e-mail and delete the information in this message
that has been received in error. The sender does not give any warranty
or accept liability as the content, accuracy or completeness of sent
messages and accepts no responsibility  for changes made after they
were sent or for other risks which arise as a result of e-mail
transmission, viruses, etc.


On Fri, Apr 21, 2017 at 1:01 PM, Andrew Grande  wrote:
> Hi,
>
> First, there won't be multiple copies of a file within NiFi. If you pass
> around the content and don't change it (only attributes), it will merely
> point a reference to it, no more.
>
> You need to decide if you want to delete processed files, this is what
> GetFile does. Might want to look into ListFile/FetchFile instead, it
> maintains internal state of files already processed.
>
> Assuming you want to delete the file from the original location, you can use
> PutFile in your file to write it to the new working directory and connect
> the success relationship to ExecuteStreamCommand.
>
> Andrew
>
>
> On Fri, Apr 21, 2017, 5:37 AM damiano.giampa...@geo-solutions.it
>  wrote:
>>
>> Hi list,
>>
>> I'm a NiFi newbie and I'm trying to figure out the best way to use it as a
>> batch ingestion system for satellite imagery as raster files.
>> The files are pushed on the FS by an external system and then they must be
>> processed and published through WMS protocols.
>> I tried to draft the flow using the NiFi processors available, summarizing
>> I used:
>>
>> - GetFile and PutFile processors to watch for incoming files to process
>> - UpdateAttributes to manage the location of the incoming files and the
>> intermediate processing products
>> - ExecuteStreamProcess to call the gdalwarp and gdaladdo command line
>> utilities to do the geospatial processing needed (http://www.gdal.org/)
>>
>> The issue I found with my use case is the fact that what represent
>> flowfiles are big raster files (1 to 6GB) and I would like to minimize the
>> number of copies of that resource.
>>
>> I used the GetFile processor to watch a FileSystem folder for incoming
>> files.
>> Once a new file is found, it is imported in the NiFi internal content
>> repository so I can't reference i

Re: NiFi best practices to manage big flowfiles

2017-04-21 Thread Andrew Grande
Hi,

First, there won't be multiple copies of a file within NiFi. If you pass
around the content and don't change it (only attributes), it will merely
point a reference to it, no more.

You need to decide if you want to delete processed files, this is what
GetFile does. Might want to look into ListFile/FetchFile instead, it
maintains internal state of files already processed.

Assuming you want to delete the file from the original location, you can
use PutFile in your file to write it to the new working directory and
connect the success relationship to ExecuteStreamCommand.

Andrew

On Fri, Apr 21, 2017, 5:37 AM damiano.giampa...@geo-solutions.it <
damiano.giampa...@geo-solutions.it> wrote:

> Hi list,
>
> I'm a NiFi newbie and I'm trying to figure out the best way to use it as a
> batch ingestion system for satellite imagery as raster files.
> The files are pushed on the FS by an external system and then they must be
> processed and published through WMS protocols.
> I tried to draft the flow using the NiFi processors available, summarizing
> I used:
>
> - GetFile and PutFile processors to watch for incoming files to process
> - UpdateAttributes to manage the location of the incoming files and the
> intermediate processing products
> - ExecuteStreamProcess to call the gdalwarp and gdaladdo command line
> utilities to do the geospatial processing needed (http://www.gdal.org/)
>
> The issue I found with my use case is the fact that what represent
> flowfiles are big raster files (1 to 6GB) and I would like to minimize the
> number of copies of that resource.
>
> I used the GetFile processor to watch a FileSystem folder for incoming
> files.
> Once a new file is found, it is imported in the NiFi internal content
> repository so I can't reference it with its absolute FS path anymore since
> it is transformed into a flowfile (Did I misunderstand something here?)
> So I have to copy it again somewhere else on the FS to process it, the
> geospatial processing utilities I have to use require the abs path of the
> files to process.
>
> There could be maybe a solution which better address the design of this
> flow, for example I can watch not for the real file but for a txt file
> which describe its FS abs path.
>
> Anyway I was wondering if it is possible to configure the GetFile
> processors to use as flowfile payload only the absolute path of the file
> found, but I guess that in that case I have to write my own GetFile
> processor. (the same could be valid also for other Getxxx processors)
>
>
> Anyone has some hints to suggest me? Am I on the right path?
> I would be sure that I am not overlooking at some NiFi concept/feature
> that can allows me to better manage this Use case.
>
>
> I hope to have been clear enough, any shared shared would be extremely
> appreciated!
>
>
> Best regards,
> Damiano
>
> --
>
> ==
> GeoServer Professional Services from the experts!
> Visit http://goo.gl/it488V for more information.
> ==
> Damiano Giampaoli
> Software Engineer
>
> GeoSolutions S.A.S.
> Via di Montramito 3/A
> 55054  Massarosa (LU)
> Italy
> phone: +39 0584 962313
> fax: +39 0584 1660272
> mob:   +39 333 8128928 <%2B39%20%20333%208128928>
>
> http://www.geo-solutions.it
> http://twitter.com/geosolutions_it
>
> ---
> AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
> Le informazioni contenute in questo messaggio di posta elettronica e/o
> nel/i file/s allegato/i sono da considerarsi strettamente riservate.
> Il loro utilizzo è consentito esclusivamente al destinatario del
> messaggio, per le finalità indicate nel messaggio stesso. Qualora
> riceviate questo messaggio senza esserne il destinatario, Vi preghiamo
> cortesemente di darcene notizia via e-mail e di procedere alla
> distruzione del messaggio stesso, cancellandolo dal Vostro sistema.
> Conservare il messaggio stesso, divulgarlo anche in parte,
> distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità
> diverse, costituisce comportamento contrario ai principi dettati dal
> D.Lgs. 196/2003.
>
> The information in this message and/or attachments, is intended solely
> for the attention and use of the named addressee(s) and may be
> confidential or proprietary in nature or covered by the provisions of
> privacy act (Legislative Decree June, 30 2003, no.196 - Italy's New
> Data Protection Code).Any use not in accord with its purpose, any
> disclosure, reproduction, copying, distribution, or either
> dissemination, either whole or partial, is strictly forbidden except
> previous formal approval of the named addressee(s). If you are not the
> intended recipient, please contact immediately the sender by
> telephone, fax or e-mail and delete the information in this message
> that has been received in error. The sender does not give any warranty
> or accept liability as the content, accuracy or completeness of sent
> messages and accepts no responsibility  for changes made after they
> were sent or for other

NiFi best practices to manage big flowfiles

2017-04-21 Thread damiano.giampa...@geo-solutions.it
Hi list,
I'm a NiFi newbie and I'm trying to figure out the best way to use it as a 
batch ingestion system for satellite imagery as raster files.The files are 
pushed on the FS by an external system and then they must be processed and 
published through WMS protocols.I tried to draft the flow using the NiFi 
processors available, summarizing I used:
- GetFile and PutFile processors to watch for incoming files to process - 
UpdateAttributes to manage the location of the incoming files and the 
intermediate processing products- ExecuteStreamProcess to call the gdalwarp and 
gdaladdo command line utilities to do the geospatial processing needed 
(http://www.gdal.org/)
The issue I found with my use case is the fact that what represent flowfiles 
are big raster files (1 to 6GB) and I would like to minimize the number of 
copies of that resource.
I used the GetFile processor to watch a FileSystem folder for incoming 
files.Once a new file is found, it is imported in the NiFi internal content 
repository so I can't reference it with its absolute FS path anymore since it 
is transformed into a flowfile (Did I misunderstand something here?)So I have 
to copy it again somewhere else on the FS to process it, the geospatial 
processing utilities I have to use require the abs path of the files to process.
There could be maybe a solution which better address the design of this flow, 
for example I can watch not for the real file but for a txt file which describe 
its FS abs path.
Anyway I was wondering if it is possible to configure the GetFile processors to 
use as flowfile payload only the absolute path of the file found, but I guess 
that in that case I have to write my own GetFile processor. (the same could be 
valid also for other Getxxx processors)

Anyone has some hints to suggest me? Am I on the right path? I would be sure 
that I am not overlooking at some NiFi concept/feature that can allows me to 
better manage this Use case.

I hope to have been clear enough, any shared shared would be extremely 
appreciated!

Best regards,Damiano
-- 


==
GeoServer Professional Services from the experts!
Visit http://goo.gl/it488V for more information.
==

Damiano Giampaoli
Software Engineer


GeoSolutions S.A.S.
Via di Montramito 3/A
55054  Massarosa (LU)
Italy
phone: +39 0584 962313
fax:     +39 0584 1660272
mob:   +39 333 8128928

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

---
AVVERTENZE AI SENSI DEL D.Lgs. 196/2003
Le informazioni contenute in questo messaggio di posta elettronica e/o
nel/i file/s allegato/i sono da considerarsi strettamente riservate.
Il loro utilizzo è consentito esclusivamente al destinatario del
messaggio, per le finalità indicate nel messaggio stesso. Qualora
riceviate questo messaggio senza esserne il destinatario, Vi preghiamo
cortesemente di darcene notizia via e-mail e di procedere alla
distruzione del messaggio stesso, cancellandolo dal Vostro sistema.
Conservare il messaggio stesso, divulgarlo anche in parte,
distribuirlo ad altri soggetti, copiarlo, od utilizzarlo per finalità
diverse, costituisce comportamento contrario ai principi dettati dal
D.Lgs. 196/2003.

The information in this message and/or attachments, is intended solely
for the attention and use of the named addressee(s) and may be
confidential or proprietary in nature or covered by the provisions of
privacy act (Legislative Decree June, 30 2003, no.196 - Italy's New
Data Protection Code).Any use not in accord with its purpose, any
disclosure, reproduction, copying, distribution, or either
dissemination, either whole or partial, is strictly forbidden except
previous formal approval of the named addressee(s). If you are not the
intended recipient, please contact immediately the sender by
telephone, fax or e-mail and delete the information in this message
that has been received in error. The sender does not give any warranty
or accept liability as the content, accuracy or completeness of sent
messages and accepts no responsibility  for changes made after they
were sent or for other risks which arise as a result of e-mail
transmission, viruses, etc.