[
https://issues.apache.org/jira/browse/NIFI-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tamas Palfy updated NIFI-9436:
------------------------------
Description:
h2. Background
In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an
*org.apache.parquet.hadoop.ParquetWriter* writer is _created_, used to write
record to an HDFS location and is later _closed_ explicitly.
The writer creation process involves the instantiation of an
*org.apache.hadoop.fs.FileSystem* object, which, when writing to ADLS is going
to be an *org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem*.
Note that the NiFi AbstractPutHDFSRecord processor already created a FileSystem
object of the same type for it's own purposes but the writer creates it's own.
The writer only uses the FileSystem object to create an
*org.apache.hadoop.fs.azurebfs.servicesAbfsOutputStream* object and doesn't
keep the FileSystem object itself.
This makes the FileSystem object eligible for garbage collection.
The AbfsOutputStream writes data asynchronously. Submits the task to an
executorservice and stores it in a collection.
h2. The issue
* The _AzureBlobFileSystem_ and the _AbfsOutputStream_ both have reference to
the same _ThreadPoolExecutor_ object.
* _AzureBlobFileSystem_ (probably depending on the version) overrides the
_finalize()_ method and closes itself when that is called. This involves
shutting down the referenced _ThreadPoolExecutor_.
* It's possible for garbage collection to occur after the _ParquetWriter_ is
created but before explicitly closing it. GC ->
_AzureBlobFileSystem.finalize()_ -> _ThreadPoolExecutor.shutdown()_.
* When the _ParquetWriter_ is explicitly closed it tries to run a cleanup job
using the _ThreadPoolExecutor_. That job submission fails as the
_ThreadPoolExecutor_ is already terminated but a _Future_ object is still
created - and is being wait for indefinitely.
This causes the processor to hang.
h2. The solution
This feels like an issue that should be addressed in the _hadoop-azure_ library
but it's possible to apply a workaround in NiFi.
The problem starts by the _AzureBlobFileSystem_ getting garbage collected. So
if the _ParquetWriter_ used the same _FileSystem_ object that the processor
already created for itself -and kept the reference for- it would prevent the
garbage collection to occur.
was:
h2. Background
In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an
*org.apache.parquet.hadoop.ParquetWriter* writer is _created_, used to write
record to an HDFS location and is later _closed_ explicitly.
The writer creation process involves the instantiation of an
org.apache.hadoop.fs.FileSystem object, which, when writing to ADLS is going to
be an org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.
Note that the NiFi AbstractPutHDFSRecord processor already created a FileSystem
object of the same type for it's own purposes but the writer creates it's own.
The writer only uses the FileSystem object to create an AbfsOutputStream object
and doesn't keep the FileSystem object itself.
This makes the FileSystem object eligible to garbage collection.
The AbfsOutputStream writes data asynchronously. Submits the task to an
executorservice and stores it in a collection.
h2. The issue
* The AzureBlobFileSystem and the AbfsOutputStream both have reference to the
same ThreadPoolExecutor object.
* AzureBlobFileSystem -probably depending on the version- overrides the
finalize() method and closes itself when that is called. This involves shutting
down the referenced ThreadPoolExecutor.
* It's possible for garbage collection to occur after the ParquetWriter is
created but before explicitly closing it. GC -> AzureBlobFileSystem.finalize()
-> ThreadPoolExecutor.shutdown().
* When the ParquetWriter is explicitly closed it tries to run a cleanup job
using the ThreadPoolExecutor. That job submission fails as the
ThreadPoolExecutor is already terminated but a Future object is still created -
and is being wait for indefinitely.
This causes the processor to hang.
h2. The solution
This feels like an issue that should be addressed in the hadoop-azure library
but it's possible to apply a workaround in NiFi.
The problem starts by the AzureBlobFileSystem getting garbage collected. So if
the ParquetWriter used the same FileSystem object that the processor already
created for itself -and kept the reference for- it would prevent the garbage
collection to occur.
> PutParquet, PutORC processors may hang after writing to ADLS
> ------------------------------------------------------------
>
> Key: NIFI-9436
> URL: https://issues.apache.org/jira/browse/NIFI-9436
> Project: Apache NiFi
> Issue Type: Bug
> Reporter: Tamas Palfy
> Priority: Major
>
> h2. Background
> In *AbstractPutHDFSRecord* (of which *PutParquet* and *PutORC* is derived) an
> *org.apache.parquet.hadoop.ParquetWriter* writer is _created_, used to write
> record to an HDFS location and is later _closed_ explicitly.
> The writer creation process involves the instantiation of an
> *org.apache.hadoop.fs.FileSystem* object, which, when writing to ADLS is
> going to be an *org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem*.
> Note that the NiFi AbstractPutHDFSRecord processor already created a
> FileSystem object of the same type for it's own purposes but the writer
> creates it's own.
> The writer only uses the FileSystem object to create an
> *org.apache.hadoop.fs.azurebfs.servicesAbfsOutputStream* object and doesn't
> keep the FileSystem object itself.
> This makes the FileSystem object eligible for garbage collection.
> The AbfsOutputStream writes data asynchronously. Submits the task to an
> executorservice and stores it in a collection.
> h2. The issue
> * The _AzureBlobFileSystem_ and the _AbfsOutputStream_ both have reference to
> the same _ThreadPoolExecutor_ object.
> * _AzureBlobFileSystem_ (probably depending on the version) overrides the
> _finalize()_ method and closes itself when that is called. This involves
> shutting down the referenced _ThreadPoolExecutor_.
> * It's possible for garbage collection to occur after the _ParquetWriter_ is
> created but before explicitly closing it. GC ->
> _AzureBlobFileSystem.finalize()_ -> _ThreadPoolExecutor.shutdown()_.
> * When the _ParquetWriter_ is explicitly closed it tries to run a cleanup job
> using the _ThreadPoolExecutor_. That job submission fails as the
> _ThreadPoolExecutor_ is already terminated but a _Future_ object is still
> created - and is being wait for indefinitely.
> This causes the processor to hang.
> h2. The solution
> This feels like an issue that should be addressed in the _hadoop-azure_
> library but it's possible to apply a workaround in NiFi.
> The problem starts by the _AzureBlobFileSystem_ getting garbage collected. So
> if the _ParquetWriter_ used the same _FileSystem_ object that the processor
> already created for itself -and kept the reference for- it would prevent the
> garbage collection to occur.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)