[jira] [Updated] (NIFI-9436) PutParquet, PutORC processors may hang after writing to ADLS

Tamas Palfy (Jira) Thu, 02 Dec 2021 06:58:20 -0800


     [ 
https://issues.apache.org/jira/browse/NIFI-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tamas Palfy updated NIFI-9436:
------------------------------
    Description: 
h2. Background

In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an 
*org.apache.parquet.hadoop.ParquetWriter* writer is {_}created{_}, used to 
write record to an HDFS location and is later _closed_ explicitly.

The writer creation process involves the instantiation of an 
*org.apache.hadoop.fs.FileSystem* object, which, when writing to ADLS is going 
to be an {*}org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem{*}.
Note that the NiFi AbstractPutHDFSRecord processor already created a FileSystem 
object of the same type for it's own purposes but the writer creates it's own.
It could still be the same if caching was enabled but it is explicitly disabled 
in *AbstractHadoopProcessor* (the parent of _AbstractPutHDFSRecord_).

The writer only uses the FileSystem object to create an 
*org.apache.hadoop.fs.azurebfs.servicesAbfsOutputStream* object and doesn't 
keep the FileSystem object itself.
This makes the FileSystem object eligible for garbage collection.

The AbfsOutputStream writes data asynchronously. Submits the task to an 
executorservice and stores it in a collection.
h2. The issue
 * The _AzureBlobFileSystem_ and the _AbfsOutputStream_ both have reference to 
the same _ThreadPoolExecutor_ object.
 * _AzureBlobFileSystem_ (probably depending on the version) overrides the 
_finalize()_ method and closes itself when that is called. This involves 
shutting down the referenced {_}ThreadPoolExecutor{_}.
 * It's possible for garbage collection to occur after the _ParquetWriter_ is 
created but before explicitly closing it. GC -> 
_AzureBlobFileSystem.finalize()_ -> {_}ThreadPoolExecutor.shutdown(){_}.
 * When the _ParquetWriter_ is explicitly closed it tries to run a cleanup job 
using the {_}ThreadPoolExecutor{_}. That job submission fails as the 
_ThreadPoolExecutor_ is already terminated but a _Future_ object is still 
created - and is being wait for indefinitely.

This causes the processor to hang.
h2. The solution

This feels like an issue that should be addressed in the _hadoop-azure_ library 
but it's possible to apply a workaround in NiFi.

The problem starts by the _AzureBlobFileSystem_ getting garbage collected. So 
if the _ParquetWriter_ used the same _FileSystem_ object that the processor 
already created for itself (and kept the reference for) it would prevent the 
garbage collection to occur.

  was:
h2. Background

In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an 
*org.apache.parquet.hadoop.ParquetWriter* writer is {_}created{_}, used to 
write record to an HDFS location and is later _closed_ explicitly.

The writer creation process involves the instantiation of an 
*org.apache.hadoop.fs.FileSystem* object, which, when writing to ADLS is going 
to be an {*}org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem{*}.
Note that the NiFi AbstractPutHDFSRecord processor already created a FileSystem 
object of the same type for it's own purposes but the writer creates it's own.
It could still be the same if caching was enabled but it is explicitly disabled 
in *AbstractHadoopProcessor* (the parent of _AbstractPutHDFSRecord_).

The writer only uses the FileSystem object to create an 
*org.apache.hadoop.fs.azurebfs.servicesAbfsOutputStream* object and doesn't 
keep the FileSystem object itself.
This makes the FileSystem object eligible for garbage collection.

The AbfsOutputStream writes data asynchronously. Submits the task to an 
executorservice and stores it in a collection.
h2. The issue
 * The _AzureBlobFileSystem_ and the _AbfsOutputStream_ both have reference to 
the same _ThreadPoolExecutor_ object.
 * _AzureBlobFileSystem_ (probably depending on the version) overrides the 
_finalize()_ method and closes itself when that is called. This involves 
shutting down the referenced {_}ThreadPoolExecutor{_}.
 * It's possible for garbage collection to occur after the _ParquetWriter_ is 
created but before explicitly closing it. GC -> 
_AzureBlobFileSystem.finalize()_ -> {_}ThreadPoolExecutor.shutdown(){_}.
 * When the _ParquetWriter_ is explicitly closed it tries to run a cleanup job 
using the {_}ThreadPoolExecutor{_}. That job submission fails as the 
_ThreadPoolExecutor_ is already terminated but a _Future_ object is still 
created - and is being wait for indefinitely.

This causes the processor to hang.
h2. The solution

This feels like an issue that should be addressed in the _hadoop-azure_ library 
but it's possible to apply a workaround in NiFi.

The problem starts by the _AzureBlobFileSystem_ getting garbage collected. So 
if the _ParquetWriter_ used the same _FileSystem_ object that the processor 
already created for itself -and kept the reference for- it would prevent the 
garbage collection to occur.


> PutParquet, PutORC processors may hang after writing to ADLS
> ------------------------------------------------------------
>
>                 Key: NIFI-9436
>                 URL: https://issues.apache.org/jira/browse/NIFI-9436
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Tamas Palfy
>            Assignee: Tamas Palfy
>            Priority: Major
>
> h2. Background
> In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an 
> *org.apache.parquet.hadoop.ParquetWriter* writer is {_}created{_}, used to 
> write record to an HDFS location and is later _closed_ explicitly.
> The writer creation process involves the instantiation of an 
> *org.apache.hadoop.fs.FileSystem* object, which, when writing to ADLS is 
> going to be an {*}org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem{*}.
> Note that the NiFi AbstractPutHDFSRecord processor already created a 
> FileSystem object of the same type for it's own purposes but the writer 
> creates it's own.
> It could still be the same if caching was enabled but it is explicitly 
> disabled in *AbstractHadoopProcessor* (the parent of _AbstractPutHDFSRecord_).
> The writer only uses the FileSystem object to create an 
> *org.apache.hadoop.fs.azurebfs.servicesAbfsOutputStream* object and doesn't 
> keep the FileSystem object itself.
> This makes the FileSystem object eligible for garbage collection.
> The AbfsOutputStream writes data asynchronously. Submits the task to an 
> executorservice and stores it in a collection.
> h2. The issue
>  * The _AzureBlobFileSystem_ and the _AbfsOutputStream_ both have reference 
> to the same _ThreadPoolExecutor_ object.
>  * _AzureBlobFileSystem_ (probably depending on the version) overrides the 
> _finalize()_ method and closes itself when that is called. This involves 
> shutting down the referenced {_}ThreadPoolExecutor{_}.
>  * It's possible for garbage collection to occur after the _ParquetWriter_ is 
> created but before explicitly closing it. GC -> 
> _AzureBlobFileSystem.finalize()_ -> {_}ThreadPoolExecutor.shutdown(){_}.
>  * When the _ParquetWriter_ is explicitly closed it tries to run a cleanup 
> job using the {_}ThreadPoolExecutor{_}. That job submission fails as the 
> _ThreadPoolExecutor_ is already terminated but a _Future_ object is still 
> created - and is being wait for indefinitely.
> This causes the processor to hang.
> h2. The solution
> This feels like an issue that should be addressed in the _hadoop-azure_ 
> library but it's possible to apply a workaround in NiFi.
> The problem starts by the _AzureBlobFileSystem_ getting garbage collected. So 
> if the _ParquetWriter_ used the same _FileSystem_ object that the processor 
> already created for itself (and kept the reference for) it would prevent the 
> garbage collection to occur.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (NIFI-9436) PutParquet, PutORC processors may hang after writing to ADLS

Reply via email to