[ https://issues.apache.org/jira/browse/NIFI-6496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896255#comment-16896255 ]
Edward Armes edited comment on NIFI-6496 at 7/30/19 4:20 PM: ------------------------------------------------------------- [~otto]: The root of this is from user list mail which I've linked here: [http://mail-archives.apache.org/mod_mbox/nifi-users/201907.mbox/%3cCAAPh5FmUoKnadoq+8r2nb=16CjZ3wt=5kozcjukpx8woei2...@mail.gmail.com%3e] Essentially the TL;DR of this was that due to (what I assume to be a) resource restriction, [~malthe] wanted to have the FlowFile content kept compressed as there content was easily compressible. Now the argument about the lack of resources I think is not relevant here. However I think what is proposed is fine for a custom processor, I don't think it is suitable for the standard processor library (for the reasons I outlined above). ---- Now as I understand it in general there are 2 types of processors in Nifi, ones that consume FlowFIles (Consumers) and ones that don't (Producers) . Now the internals of Nifi are easy to understand/explain in words, but hard to navigate in code. As I said FlowFile content is *never* kept in memory unless it is being used by a processor it is instead kept in the content repo (now that may be an in-memory repo). An actual FlowFIle (the object that is passed between processor to processor is kept FlowFile repo and inmemory until a time it is deemed to be inactive (for whatever reason) and then it is only in the FlowFile repo (again this can could be an in-memory only repo as well). Now, in correction to what I posted originally, the content of a FlowFIle is always exposed to a processor by an InputStream so it is always using non-blocking IO and thus the entire content for a FlowFile is not always in memory unless the processor author wishes it. However what think the issue the original question to the mailing list is trying skate around is that the content and provenance repos can grow quickly when the contents of a FlowFile is modified. Now like [~malthe] has said one option would be to do some sort of shunt that would allow for de-compression on demand once the content has be run through the CompressContent processor quite how this would work I don't know, I suspect it would involve quite a bit of playing with the internals however. The other approach I could see is to modify either the default content repo implementation, the loading of content from the repo, or both. To enable compression and de-compresson of the content when it's loaded under certain circumstances. Now there would be of-course trade off's and there is also the "per-flow settings" question as well and therefore needs to be a discussion about this more in general. was (Author: bickerx2): [~otto]: The root of this is from user list mail which I've linked here: [http://mail-archives.apache.org/mod_mbox/nifi-users/201907.mbox/%3cCAAPh5FmUoKnadoq+8r2nb=16CjZ3wt=5kozcjukpx8woei2...@mail.gmail.com%3e] Essentially the TL;DR of this was that due to (what I assume to be a) resource restriction, the person wanted to have the FlowFile content kept compressed as there content was easily compressible. Now the argument about the lack of resources I think is not relevant here. However I think what is proposed is fine for a custom processor, I don't think it is suitable for the standard processor library (for the reasons I outlined above). ----- Now as I understand it in general there are 2 types of processors in Nifi, ones that consume FlowFIles (Consumers) and ones that don't (Producers) . Now the internals of Nifi are easy to understand/explain in words, but hard to navigate in code. As I said FlowFile content is *never* kept in memory unless it is being used by a processor it is instead kept in the content repo (now that may be an in-memory repo). An actual FlowFIle (the object that is passed between processor to processor is kept FlowFile repo and inmemory until a time it is deemed to be inactive (for whatever reason) and then it is only in the FlowFile repo (again this can could be an in-memory only repo as well). Now, in correction to what I posted originally, the content of a FlowFIle is always exposed to a processor by an InputStream so it is always using non-blocking IO and thus the entire content for a FlowFile is not always in memory unless the processor author wishes it. However what think the issue the original question to the mailing list is trying skate around is that the content and provenance repos can grow quickly when the contents of a FlowFile is modified. Now like [~malthe] has said one option would be to do some sort of shunt that would allow for de-compression on demand once the content has be run through the CompressContent processor quite how this would work I don't know, I suspect it would involve quite a bit of playing with the internals however. The other approach I could see is to modify either the default content repo implementation, the loading of content from the repo, or both. To enable compression and de-compresson of the content when it's loaded under certain circumstances. Now there would be of-course trade off's and there is also the "per-flow settings" question as well and therefore needs to be a discussion about this more in general. > Add compression support to record reader processor > -------------------------------------------------- > > Key: NIFI-6496 > URL: https://issues.apache.org/jira/browse/NIFI-6496 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions > Reporter: Malthe Borch > Priority: Minor > Labels: easyfix, usability > > Text-based record formats such as CSV, JSON and XML compress well and will > often be transmitted in a compressed format. If compression support is added > to the relevant processors, users will not need to explicitly unpack files > before processing (which may not be feasible or practical due to space > requirements). > There are at least two ways of implementing this, using either a generic > approach where a {{CompressedRecordReaderFactory}} is the basis for a new > controller service that wraps the underlying record reader controller service > (e.g. {{CSVReader}}); or adding the functionality at the relevant record > reader implementations. > The latter option may provide a better UX because no additional > {{ControllerService}} has to be configured. -- This message was sent by Atlassian JIRA (v7.6.14#76016)