This is an automated email from the ASF dual-hosted git repository. exceptionfactory pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/nifi.git
The following commit(s) were added to refs/heads/main by this push: new 4d3fcb6843 NIFI-11000 Add compression example to CreateHadoopSequenceFile documentation 4d3fcb6843 is described below commit 4d3fcb684395ca1be0bef74e96f73dcdfc105fad Author: Peter Gyori <peter.gyori....@gmail.com> AuthorDate: Wed Dec 21 17:55:34 2022 +0100 NIFI-11000 Add compression example to CreateHadoopSequenceFile documentation This closes #6801 Signed-off-by: David Handermann <exceptionfact...@apache.org> --- .../additionalDetails.html | 82 +++++++++++++++++----- 1 file changed, 64 insertions(+), 18 deletions(-) diff --git a/nifi-nar-bundles/nifi-hadoop-bundle/nifi-hdfs-processors/src/main/resources/docs/org.apache.nifi.processors.hadoop.CreateHadoopSequenceFile/additionalDetails.html b/nifi-nar-bundles/nifi-hadoop-bundle/nifi-hdfs-processors/src/main/resources/docs/org.apache.nifi.processors.hadoop.CreateHadoopSequenceFile/additionalDetails.html index b8bf7c2d99..9f754a0724 100644 --- a/nifi-nar-bundles/nifi-hadoop-bundle/nifi-hdfs-processors/src/main/resources/docs/org.apache.nifi.processors.hadoop.CreateHadoopSequenceFile/additionalDetails.html +++ b/nifi-nar-bundles/nifi-hadoop-bundle/nifi-hdfs-processors/src/main/resources/docs/org.apache.nifi.processors.hadoop.CreateHadoopSequenceFile/additionalDetails.html @@ -23,24 +23,70 @@ <body> <!-- Processor Documentation ================================================== --> - <h2>Description:</h2> - <p>This processor is used to create a Hadoop Sequence File, which essentially is a file of key/value pairs. The key - will be a file name and the value will be the flow file content. The processor will take either a merged (a.k.a. packaged) flow - file or a singular flow file. Historically, this processor handled the merging by type and size or time prior to creating a + <h2>Description</h2> + <p> + This processor is used to create a Hadoop Sequence File, which essentially is a file of key/value pairs. The key + will be a file name and the value will be the flow file content. The processor will take either a merged (a.k.a. packaged) flow + file or a singular flow file. Historically, this processor handled the merging by type and size or time prior to creating a SequenceFile output; it no longer does this. If creating a SequenceFile that contains multiple files of the same type is desired, precede this processor with a <code>RouteOnAttribute</code> processor to segregate files of the same type and follow that with a - <code>MergeContent</code> processor to bundle up files. If the type of files is not important, just use the - <code>MergeContent</code> processor. When using the <code>MergeContent</code> processor, the following Merge Formats are + <code>MergeContent</code> processor to bundle up files. If the type of files is not important, just use the + <code>MergeContent</code> processor. When using the <code>MergeContent</code> processor, the following Merge Formats are supported by this processor: - <ul> - <li>TAR</li> - <li>ZIP</li> - <li>FlowFileStream v3</li> - </ul> - The created SequenceFile is named the same as the incoming FlowFile with the suffix '.sf'. For incoming FlowFiles that are - bundled, the keys in the SequenceFile are the individual file names, the values are the contents of each file. - </p> - NOTE: The value portion of a key/value pair is loaded into memory. While there is a max size limit of 2GB, this could cause memory - issues if there are too many concurrent tasks and the flow file sizes are large. -</body> -</html> + <ul> + <li>TAR</li> + <li>ZIP</li> + <li>FlowFileStream v3</li> + </ul> + The created SequenceFile is named the same as the incoming FlowFile with the suffix '.sf'. For incoming FlowFiles that are + bundled, the keys in the SequenceFile are the individual file names, the values are the contents of each file. + </p> + <p> + NOTE: The value portion of a key/value pair is loaded into memory. While there is a max size limit of 2GB, this could cause memory + issues if there are too many concurrent tasks and the flow file sizes are large. + </p> + + <h2>Using Compression</h2> + <p> + The value of the <code>Compression codec</code> property determines the compression library the processor uses to compress content. + Third party libraries are used for compression. These third party libraries can be Java libraries or native libraries. + In case of native libraries, the path of the parent folder needs to be in an environment variable called <code>LD_LIBRARY_PATH</code> so that NiFi can find the libraries. + </p> + <h3>Example: using Snappy compression with native library on CentOS</h3> + <p> + <ol> + <li> + Snappy compression needs to be installed on the server running NiFi: + <br/> + <code>sudo yum install snappy</code> + <br/> + </li> + <li> + Suppose that the server running NiFi has the native compression libraries in <code>/opt/lib/hadoop/lib/native</code> . + (Native libraries have file extensions like <code>.so</code>, <code>.dll</code>, <code>.lib</code>, etc. depending on the platform.) + <br/> + We need to make sure that the files can be executed by the NiFi process' user. For this purpose we can make a copy of these files + to e.g. <code>/opt/nativelibs</code> and change their owner. If NiFi is executed by <code>nifi</code> user in the <code>nifi</code> group, then: + <br/> + <code>chown nifi:nifi /opt/nativelibs</code> + <br/> + <code>chown nifi:nifi /opt/nativelibs/*</code> + <br/> + </li> + <li> + The <code>LD_LIBRARY_PATH</code> needs to be set to contain the path to the folder <code>/opt/nativelibs</code>. + <br/> + </li> + <li> + NiFi needs to be restarted. + </li> + <li> + <code>Compression codec</code> property can be set to <code>SNAPPY</code> and a <code>Compression type</code> can be selected. + </li> + <li> + The processor can be started. + </li> + </ol> + </p> + </body> +</html> \ No newline at end of file