[jira] [Updated] (NIFI-9436) PutParquet, PutORC processors may hang after writing to ADLS

Tamas Palfy (Jira) Thu, 02 Dec 2021 06:14:11 -0800


     [ 
https://issues.apache.org/jira/browse/NIFI-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tamas Palfy updated NIFI-9436:
------------------------------
    Description: 
h2. Background
In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an 
*org.apache.parquet.hadoop.ParquetWriter* writer is _created_, used to write 
record to an HDFS location and is later _closed_ explicitly.

The writer creation process involves the instantiation of an 
*org.apache.hadoop.fs.FileSystem* object, which, when writing to ADLS is going 
to be an *org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem*.
Note that the NiFi AbstractPutHDFSRecord processor already created a FileSystem 
object of the same type for it's own purposes but the writer creates it's own.

The writer only uses the FileSystem object to create an 
*org.apache.hadoop.fs.azurebfs.servicesAbfsOutputStream* object and doesn't 
keep the FileSystem object itself.
This makes the FileSystem object eligible for garbage collection.

The AbfsOutputStream writes data asynchronously. Submits the task to an 
executorservice and stores it in a collection.

h2. The issue

* The _AzureBlobFileSystem_ and the _AbfsOutputStream_ both have reference to 
the same _ThreadPoolExecutor_ object.
* _AzureBlobFileSystem_ (probably depending on the version) overrides the 
_finalize()_ method and closes itself when that is called. This involves 
shutting down the referenced _ThreadPoolExecutor_.
* It's possible for garbage collection to occur after the _ParquetWriter_ is 
created but before explicitly closing it. GC -> 
_AzureBlobFileSystem.finalize()_ -> _ThreadPoolExecutor.shutdown()_.
* When the _ParquetWriter_ is explicitly closed it tries to run a cleanup job 
using the _ThreadPoolExecutor_. That job submission fails as the 
_ThreadPoolExecutor_ is already terminated but a _Future_ object is still 
created -  and is being wait for indefinitely.

This causes the processor to hang.

h2. The solution

This feels like an issue that should be addressed in the _hadoop-azure_ library 
but it's possible to apply a workaround in NiFi.

The problem starts by the _AzureBlobFileSystem_ getting garbage collected. So 
if the _ParquetWriter_ used the same _FileSystem_ object that the processor 
already created for itself -and kept the reference for- it would prevent the 
garbage collection to occur.



  was:
h2. Background
In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an 
*org.apache.parquet.hadoop.ParquetWriter* writer is _created_, used to write 
record to an HDFS location and is later _closed_ explicitly.

The writer creation process involves the instantiation of an 
org.apache.hadoop.fs.FileSystem object, which, when writing to ADLS is going to 
be an org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.
Note that the NiFi AbstractPutHDFSRecord processor already created a FileSystem 
object of the same type for it's own purposes but the writer creates it's own.

The writer only uses the FileSystem object to create an AbfsOutputStream object 
and doesn't keep the FileSystem object itself.
This makes the FileSystem object eligible to garbage collection.

The AbfsOutputStream writes data asynchronously. Submits the task to an 
executorservice and stores it in a collection.

h2. The issue

* The AzureBlobFileSystem and the AbfsOutputStream both have reference to the 
same ThreadPoolExecutor object.
* AzureBlobFileSystem -probably depending on the version- overrides the 
finalize() method and closes itself when that is called. This involves shutting 
down the referenced ThreadPoolExecutor.
* It's possible for garbage collection to occur after the ParquetWriter is 
created but before explicitly closing it. GC -> AzureBlobFileSystem.finalize() 
-> ThreadPoolExecutor.shutdown().
* When the ParquetWriter is explicitly closed it tries to run a cleanup job 
using the ThreadPoolExecutor. That job submission fails as the 
ThreadPoolExecutor is already terminated but a Future object is still created - 
 and is being wait for indefinitely.

This causes the processor to hang.

h2. The solution

This feels like an issue that should be addressed in the hadoop-azure library 
but it's possible to apply a workaround in NiFi.

The problem starts by the AzureBlobFileSystem getting garbage collected. So if 
the ParquetWriter used the same FileSystem object that the processor already 
created for itself -and kept the reference for- it would prevent the garbage 
collection to occur.




> PutParquet, PutORC processors may hang after writing to ADLS
> ------------------------------------------------------------
>
>                 Key: NIFI-9436
>                 URL: https://issues.apache.org/jira/browse/NIFI-9436
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Tamas Palfy
>            Priority: Major
>
> h2. Background
> In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an 
> *org.apache.parquet.hadoop.ParquetWriter* writer is _created_, used to write 
> record to an HDFS location and is later _closed_ explicitly.
> The writer creation process involves the instantiation of an 
> *org.apache.hadoop.fs.FileSystem* object, which, when writing to ADLS is 
> going to be an *org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem*.
> Note that the NiFi AbstractPutHDFSRecord processor already created a 
> FileSystem object of the same type for it's own purposes but the writer 
> creates it's own.
> The writer only uses the FileSystem object to create an 
> *org.apache.hadoop.fs.azurebfs.servicesAbfsOutputStream* object and doesn't 
> keep the FileSystem object itself.
> This makes the FileSystem object eligible for garbage collection.
> The AbfsOutputStream writes data asynchronously. Submits the task to an 
> executorservice and stores it in a collection.
> h2. The issue
> * The _AzureBlobFileSystem_ and the _AbfsOutputStream_ both have reference to 
> the same _ThreadPoolExecutor_ object.
> * _AzureBlobFileSystem_ (probably depending on the version) overrides the 
> _finalize()_ method and closes itself when that is called. This involves 
> shutting down the referenced _ThreadPoolExecutor_.
> * It's possible for garbage collection to occur after the _ParquetWriter_ is 
> created but before explicitly closing it. GC -> 
> _AzureBlobFileSystem.finalize()_ -> _ThreadPoolExecutor.shutdown()_.
> * When the _ParquetWriter_ is explicitly closed it tries to run a cleanup job 
> using the _ThreadPoolExecutor_. That job submission fails as the 
> _ThreadPoolExecutor_ is already terminated but a _Future_ object is still 
> created -  and is being wait for indefinitely.
> This causes the processor to hang.
> h2. The solution
> This feels like an issue that should be addressed in the _hadoop-azure_ 
> library but it's possible to apply a workaround in NiFi.
> The problem starts by the _AzureBlobFileSystem_ getting garbage collected. So 
> if the _ParquetWriter_ used the same _FileSystem_ object that the processor 
> already created for itself -and kept the reference for- it would prevent the 
> garbage collection to occur.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (NIFI-9436) PutParquet, PutORC processors may hang after writing to ADLS

Reply via email to