Re: How does Nifi ingest large files?

2016-10-27 Thread Jeremy Farbota
Indeed.

I went ahead and configured my dev cluster to use RAM-disk for content and
flowfile repositories and turned back on FileSystemRepository and
WriteAheadFlowFileRepository respectively. As long as the
content/provenance archive is off, I'm good wrt compliance. The performance
seems great so far today. Will report if I have issues. I'm using CentOS 7
(mount on tempfs via /etc/fstab).

Thanks, Joe!

On Thu, Oct 27, 2016 at 8:42 PM, Andy LoPresto  wrote:

> I think Jeremy is using Volatile specifically because he does *not* want
> that data ever persisted to disk for compliance purposes.
>
> Andy LoPresto
> alopre...@apache.org
> *alopresto.apa...@gmail.com *
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Oct 27, 2016, at 8:59 AM, Joe Witt  wrote:
>
> I should add that if you're comfortable with that sort of volatile
> behavior a better path to consider is to setup a RAM-Disk and just run a
> persistent content repository on that.  It will survive process restarts,
> give better memory/heap behavior (by a lot), but you'll lose data on system
> restarts.
>
> Thanks
> Joe
>
> On Thu, Oct 27, 2016 at 11:58 AM, Joe Witt  wrote:
>
>> That is correct.
>>
>> Thanks
>> Joe
>>
>> On Thu, Oct 27, 2016 at 11:55 AM, Jeremy Farbota 
>> wrote:
>>
>>> Bryan,
>>>
>>> If I have the content repo implementation set to
>>> org.apache.nifi.controller.repository.VolatileContentRepository, it
>>> will stream the content in memory, correct?
>>>
>>> On Thu, Oct 27, 2016 at 6:22 AM, Bryan Bende  wrote:
>>>
 Monica,

 Are you asking what does NiFi do when it picks up a large file from the
 filesystem using a processor like GetFile?

 If so, it will stream the content of that file into NiFi's content
 repository, and create a FlowFile pointing to that content. As far as NiFi
 is concerned the content is just bytes at this point and has not been
 changed in anyway from the original file.

 The content is not held in memory, and the FlowFile can move through
 many processors without ever accessing the content, unless the processor
 needs to, and then when accessing the content it is typically done in a
 streaming fashion (when possible) to avoid loading the large content into
 memory.

 There are processors that can then split up the content based on
 specific data formats, for example SplitText, SplitJSON, SplitAvro, etc..
 but it is up to the designer of the flow to do that.

 -Bryan


 On Thu, Oct 27, 2016 at 4:52 AM, Monica Franceschini <
 monica.francesch...@eng.it> wrote:

> Hi,
> I'm figuring out how does Nifi ingest large files: does it split them
> into chunks or is it a massive load?Can you please, explain the behavior?
> Kind regards,
> Monica
> --
>
> *Monica Franceschini*
> Solution Architecture Manager
>
> *Big Data Competence Center Engineering Group*
> Corso Stati Uniti 23/C, 35127 Padova, Italia
> Tel: +39 049.8283547
> Fax: +39 049.8692566
> Twitter: @twittmonique
> www.spagobi.org - www.eng.it 
> *proud SpagoBI supporter and contributor*
> 
>Respect the environment. Please don't print this e-mail
> unless you really need to.
>
> The information transmitted is intended only for the person or entity
> to which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
>


>>>
>>>
>>> --
>>>
>>> [image: Payoff, Inc.] 
>>>
>>> Jeremy Farbota
>>> Software Engineer, Data
>>> jfarb...@payoff.com  • (217) 898-8110 <(949)+430-0630>
>>>
>>> I'm a Storyteller. Discover your Financial Personality!
>>> 
>>>
>>> [image: Facebook]   [image: Twitter]
>>>  [image: Linkedin]
>>> 
>>>
>>
>>
>
>


-- 

[image: Payoff, Inc.] 

Jeremy Farbota
Software Engineer, Data
jfarb...@payoff.com  • (217) 898-8110 <(949)+430-0630>

I'm a Storyteller. Discover your Financial Personality!


[image: Facebook]   [image: Twitter]
 [image: Linkedin]



Re: How does Nifi ingest large files?

2016-10-27 Thread Andy LoPresto
I think Jeremy is using Volatile specifically because he does *not* want that 
data ever persisted to disk for compliance purposes.

Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Oct 27, 2016, at 8:59 AM, Joe Witt  wrote:
> 
> I should add that if you're comfortable with that sort of volatile behavior a 
> better path to consider is to setup a RAM-Disk and just run a persistent 
> content repository on that.  It will survive process restarts, give better 
> memory/heap behavior (by a lot), but you'll lose data on system restarts.
> 
> Thanks
> Joe
> 
> On Thu, Oct 27, 2016 at 11:58 AM, Joe Witt  > wrote:
> That is correct.
> 
> Thanks
> Joe
> 
> On Thu, Oct 27, 2016 at 11:55 AM, Jeremy Farbota  > wrote:
> Bryan,
> 
> If I have the content repo implementation set to 
> org.apache.nifi.controller.repository.VolatileContentRepository, it will 
> stream the content in memory, correct?
> 
> On Thu, Oct 27, 2016 at 6:22 AM, Bryan Bende  > wrote:
> Monica,
> 
> Are you asking what does NiFi do when it picks up a large file from the 
> filesystem using a processor like GetFile?
> 
> If so, it will stream the content of that file into NiFi's content 
> repository, and create a FlowFile pointing to that content. As far as NiFi is 
> concerned the content is just bytes at this point and has not been changed in 
> anyway from the original file.
> 
> The content is not held in memory, and the FlowFile can move through many 
> processors without ever accessing the content, unless the processor needs to, 
> and then when accessing the content it is typically done in a streaming 
> fashion (when possible) to avoid loading the large content into memory.
> 
> There are processors that can then split up the content based on specific 
> data formats, for example SplitText, SplitJSON, SplitAvro, etc.. but it is up 
> to the designer of the flow to do that.
> 
> -Bryan
> 
> 
> On Thu, Oct 27, 2016 at 4:52 AM, Monica Franceschini 
> mailto:monica.francesch...@eng.it>> wrote:
> Hi,
> I'm figuring out how does Nifi ingest large files: does it split them into 
> chunks or is it a massive load?Can you please, explain the behavior?
> Kind regards,
> Monica
> --
> Monica Franceschini
> Solution Architecture Manager
> 
> Big Data Competence Center
> Engineering Group
> Corso Stati Uniti 23/C, 35127 Padova, Italia
> Tel: +39 049.8283547 
> Fax: +39 049.8692566 
> Twitter: @twittmonique
> www.spagobi.org  - www.eng.it 
> proud SpagoBI supporter and 
> contributor
> 
>  Respect the environment. Please don't print this e-mail 
> unless you really need to.
> The information transmitted is intended only for the person or entity to 
> which it is addressed and may contain confidential and/or privileged 
> material. Any review, retransmission, dissemination or other use of, or 
> taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer.
> 
> 
> 
> 
> 
> --
>  
> Jeremy Farbota
> Software Engineer, Data
> jfarb...@payoff.com  • (217) 898-8110 
> 
> I'm a Storyteller. Discover your Financial Personality! 
> 
>        
> 
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: How does Nifi ingest large files?

2016-10-27 Thread Joe Witt
I should add that if you're comfortable with that sort of volatile behavior
a better path to consider is to setup a RAM-Disk and just run a persistent
content repository on that.  It will survive process restarts, give better
memory/heap behavior (by a lot), but you'll lose data on system restarts.

Thanks
Joe

On Thu, Oct 27, 2016 at 11:58 AM, Joe Witt  wrote:

> That is correct.
>
> Thanks
> Joe
>
> On Thu, Oct 27, 2016 at 11:55 AM, Jeremy Farbota 
> wrote:
>
>> Bryan,
>>
>> If I have the content repo implementation set to
>> org.apache.nifi.controller.repository.VolatileContentRepository, it will
>> stream the content in memory, correct?
>>
>> On Thu, Oct 27, 2016 at 6:22 AM, Bryan Bende  wrote:
>>
>>> Monica,
>>>
>>> Are you asking what does NiFi do when it picks up a large file from the
>>> filesystem using a processor like GetFile?
>>>
>>> If so, it will stream the content of that file into NiFi's content
>>> repository, and create a FlowFile pointing to that content. As far as NiFi
>>> is concerned the content is just bytes at this point and has not been
>>> changed in anyway from the original file.
>>>
>>> The content is not held in memory, and the FlowFile can move through
>>> many processors without ever accessing the content, unless the processor
>>> needs to, and then when accessing the content it is typically done in a
>>> streaming fashion (when possible) to avoid loading the large content into
>>> memory.
>>>
>>> There are processors that can then split up the content based on
>>> specific data formats, for example SplitText, SplitJSON, SplitAvro, etc..
>>> but it is up to the designer of the flow to do that.
>>>
>>> -Bryan
>>>
>>>
>>> On Thu, Oct 27, 2016 at 4:52 AM, Monica Franceschini <
>>> monica.francesch...@eng.it> wrote:
>>>
 Hi,
 I'm figuring out how does Nifi ingest large files: does it split them
 into chunks or is it a massive load?Can you please, explain the behavior?
 Kind regards,
 Monica
 --

 *Monica Franceschini*
 Solution Architecture Manager

 *Big Data Competence Center Engineering Group*
 Corso Stati Uniti 23/C, 35127 Padova, Italia
 Tel: +39 049.8283547
 Fax: +39 049.8692566
 Twitter: @twittmonique
 www.spagobi.org - www.eng.it  *proud
 SpagoBI supporter and contributor*
 [image: SpagoBI]
   Respect the environment. Please don't print this e-mail unless you
 really need to.

 The information transmitted is intended only for the person or entity
 to which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of, or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you received
 this in error, please contact the sender and delete the material from any
 computer.

>>>
>>>
>>
>>
>> --
>>
>> [image: Payoff, Inc.] 
>>
>> Jeremy Farbota
>> Software Engineer, Data
>> jfarb...@payoff.com  • (217) 898-8110 <(949)+430-0630>
>>
>> I'm a Storyteller. Discover your Financial Personality!
>> 
>>
>> [image: Facebook]   [image: Twitter]
>>  [image: Linkedin]
>> 
>>
>
>


Re: How does Nifi ingest large files?

2016-10-27 Thread Joe Witt
That is correct.

Thanks
Joe

On Thu, Oct 27, 2016 at 11:55 AM, Jeremy Farbota 
wrote:

> Bryan,
>
> If I have the content repo implementation set to
> org.apache.nifi.controller.repository.VolatileContentRepository, it will
> stream the content in memory, correct?
>
> On Thu, Oct 27, 2016 at 6:22 AM, Bryan Bende  wrote:
>
>> Monica,
>>
>> Are you asking what does NiFi do when it picks up a large file from the
>> filesystem using a processor like GetFile?
>>
>> If so, it will stream the content of that file into NiFi's content
>> repository, and create a FlowFile pointing to that content. As far as NiFi
>> is concerned the content is just bytes at this point and has not been
>> changed in anyway from the original file.
>>
>> The content is not held in memory, and the FlowFile can move through many
>> processors without ever accessing the content, unless the processor needs
>> to, and then when accessing the content it is typically done in a streaming
>> fashion (when possible) to avoid loading the large content into memory.
>>
>> There are processors that can then split up the content based on specific
>> data formats, for example SplitText, SplitJSON, SplitAvro, etc.. but it is
>> up to the designer of the flow to do that.
>>
>> -Bryan
>>
>>
>> On Thu, Oct 27, 2016 at 4:52 AM, Monica Franceschini <
>> monica.francesch...@eng.it> wrote:
>>
>>> Hi,
>>> I'm figuring out how does Nifi ingest large files: does it split them
>>> into chunks or is it a massive load?Can you please, explain the behavior?
>>> Kind regards,
>>> Monica
>>> --
>>>
>>> *Monica Franceschini*
>>> Solution Architecture Manager
>>>
>>> *Big Data Competence Center Engineering Group*
>>> Corso Stati Uniti 23/C, 35127 Padova, Italia
>>> Tel: +39 049.8283547
>>> Fax: +39 049.8692566
>>> Twitter: @twittmonique
>>> www.spagobi.org - www.eng.it  *proud
>>> SpagoBI supporter and contributor*
>>> [image: SpagoBI]
>>>   Respect the environment. Please don't print this e-mail unless you
>>> really need to.
>>>
>>> The information transmitted is intended only for the person or entity to
>>> which it is addressed and may contain confidential and/or privileged
>>> material. Any review, retransmission, dissemination or other use of, or
>>> taking of any action in reliance upon, this information by persons or
>>> entities other than the intended recipient is prohibited. If you received
>>> this in error, please contact the sender and delete the material from any
>>> computer.
>>>
>>
>>
>
>
> --
>
> [image: Payoff, Inc.] 
>
> Jeremy Farbota
> Software Engineer, Data
> jfarb...@payoff.com  • (217) 898-8110 <(949)+430-0630>
>
> I'm a Storyteller. Discover your Financial Personality!
> 
>
> [image: Facebook]   [image: Twitter]
>  [image: Linkedin]
> 
>


Re: How does Nifi ingest large files?

2016-10-27 Thread Jeremy Farbota
Bryan,

If I have the content repo implementation set to
org.apache.nifi.controller.repository.VolatileContentRepository,
it will stream the content in memory, correct?

On Thu, Oct 27, 2016 at 6:22 AM, Bryan Bende  wrote:

> Monica,
>
> Are you asking what does NiFi do when it picks up a large file from the
> filesystem using a processor like GetFile?
>
> If so, it will stream the content of that file into NiFi's content
> repository, and create a FlowFile pointing to that content. As far as NiFi
> is concerned the content is just bytes at this point and has not been
> changed in anyway from the original file.
>
> The content is not held in memory, and the FlowFile can move through many
> processors without ever accessing the content, unless the processor needs
> to, and then when accessing the content it is typically done in a streaming
> fashion (when possible) to avoid loading the large content into memory.
>
> There are processors that can then split up the content based on specific
> data formats, for example SplitText, SplitJSON, SplitAvro, etc.. but it is
> up to the designer of the flow to do that.
>
> -Bryan
>
>
> On Thu, Oct 27, 2016 at 4:52 AM, Monica Franceschini <
> monica.francesch...@eng.it> wrote:
>
>> Hi,
>> I'm figuring out how does Nifi ingest large files: does it split them
>> into chunks or is it a massive load?Can you please, explain the behavior?
>> Kind regards,
>> Monica
>> --
>>
>> *Monica Franceschini*
>> Solution Architecture Manager
>>
>> *Big Data Competence Center Engineering Group*
>> Corso Stati Uniti 23/C, 35127 Padova, Italia
>> Tel: +39 049.8283547
>> Fax: +39 049.8692566
>> Twitter: @twittmonique
>> www.spagobi.org - www.eng.it  *proud
>> SpagoBI supporter and contributor*
>> [image: SpagoBI]
>>   Respect the environment. Please don't print this e-mail unless you
>> really need to.
>>
>> The information transmitted is intended only for the person or entity to
>> which it is addressed and may contain confidential and/or privileged
>> material. Any review, retransmission, dissemination or other use of, or
>> taking of any action in reliance upon, this information by persons or
>> entities other than the intended recipient is prohibited. If you received
>> this in error, please contact the sender and delete the material from any
>> computer.
>>
>
>


-- 

[image: Payoff, Inc.] 

Jeremy Farbota
Software Engineer, Data
jfarb...@payoff.com  • (217) 898-8110 <(949)+430-0630>

I'm a Storyteller. Discover your Financial Personality!


[image: Facebook]   [image: Twitter]
 [image: Linkedin]



Re: How does Nifi ingest large files?

2016-10-27 Thread Monica Franceschini

I will check,

thank you!

*Monica Franceschini*
Solution Architecture Manager

*Big Data Competence Center
Engineering Group*
Corso Stati Uniti 23/C, 35127 Padova, Italia
Tel: +39 049.8283547
Fax: +39 049.8692566
Twitter: @twittmonique
www.spagobi.org  - www.eng.it 
 		*proud SpagoBI supporter and 
contributor*

SpagoBI


	  Respect the environment. Please don't print this e-mail unless you 
really need to.


The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you 
received this in error, please contact the sender and delete the 
material from any computer.


Il 27/10/2016 15:55, Bryan Bende ha scritto:
In the case of a GetFile processor it is managed by a single node 
since the file being picked up is on the local filesystem of one of 
the nodes.


There are other approaches to parallelize work... If you had a shared 
network location you can use ListFile + FetchFile in a certain way so 
that one node does the listing, and then all nodes do fetching.
The same can be done for ListHDFS + FetchHDFS, and some other List + 
Fetch processors.


This post talks about some of this: 
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html


-Bryan


On Thu, Oct 27, 2016 at 9:40 AM, Monica Franceschini 
mailto:monica.francesch...@eng.it>> wrote:


Thank you Bryan,

yes that's what I meant and it makes sense to me. Only a further
question: is this stream parallelized if needed on the
(hypothetical) Nifi cluster  or it is managed by a single node?

Cheers

Monica







Re: How does Nifi ingest large files?

2016-10-27 Thread Bryan Bende
In the case of a GetFile processor it is managed by a single node since the
file being picked up is on the local filesystem of one of the nodes.

There are other approaches to parallelize work... If you had a shared
network location you can use ListFile + FetchFile in a certain way so that
one node does the listing, and then all nodes do fetching.
The same can be done for ListHDFS + FetchHDFS, and some other List + Fetch
processors.

This post talks about some of this:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

-Bryan


On Thu, Oct 27, 2016 at 9:40 AM, Monica Franceschini <
monica.francesch...@eng.it> wrote:

> Thank you Bryan,
>
> yes that's what I meant and it makes sense to me. Only a further question:
> is this stream parallelized if needed on the (hypothetical) Nifi cluster
> or it is managed by a single node?
>
> Cheers
>
> Monica
>
>


Re: How does Nifi ingest large files?

2016-10-27 Thread Monica Franceschini

Thank you Bryan,

yes that's what I meant and it makes sense to me. Only a further 
question: is this stream parallelized if needed on the (hypothetical) 
Nifi cluster  or it is managed by a single node?


Cheers

Monica




Re: How does Nifi ingest large files?

2016-10-27 Thread Bryan Bende
Monica,

Are you asking what does NiFi do when it picks up a large file from the
filesystem using a processor like GetFile?

If so, it will stream the content of that file into NiFi's content
repository, and create a FlowFile pointing to that content. As far as NiFi
is concerned the content is just bytes at this point and has not been
changed in anyway from the original file.

The content is not held in memory, and the FlowFile can move through many
processors without ever accessing the content, unless the processor needs
to, and then when accessing the content it is typically done in a streaming
fashion (when possible) to avoid loading the large content into memory.

There are processors that can then split up the content based on specific
data formats, for example SplitText, SplitJSON, SplitAvro, etc.. but it is
up to the designer of the flow to do that.

-Bryan


On Thu, Oct 27, 2016 at 4:52 AM, Monica Franceschini <
monica.francesch...@eng.it> wrote:

> Hi,
> I'm figuring out how does Nifi ingest large files: does it split them into
> chunks or is it a massive load?Can you please, explain the behavior?
> Kind regards,
> Monica
> --
>
> *Monica Franceschini*
> Solution Architecture Manager
>
> *Big Data Competence Center Engineering Group*
> Corso Stati Uniti 23/C, 35127 Padova, Italia
> Tel: +39 049.8283547
> Fax: +39 049.8692566
> Twitter: @twittmonique
> www.spagobi.org - www.eng.it  *proud
> SpagoBI supporter and contributor*
> [image: SpagoBI]
>   Respect the environment. Please don't print this e-mail unless you
> really need to.
>
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
>