[ 
https://issues.apache.org/jira/browse/FLINK-34696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827883#comment-17827883
 ] 

Simon-Shlomo Poil edited comment on FLINK-34696 at 3/18/24 11:41 AM:
---------------------------------------------------------------------

Dear Galen,

Thank you for your detailed feedback. I have several suggestions and questions 
regarding the current implementation:

{*}Efficiency in Blob Composition{*}: Instead of creating temporary composite 
blobs, could we directly append all blobs to the final blob? This approach 
would reduce data duplication significantly, i.e. to only double the amount of 
data. Currently, the process involves creating multiple temporary blobs that in 
the final steps nearly match the size of the final blob, leading to an 
unnecessary increase in data size, especially when dealing with TiB-sized 
blobs.  In our case, e.g., we saw a more than 4000 times increase in the data 
amount. We see about  4.5 million small blobs being created, in the end 
multiple TiB sized temporary composite blobs. This means that although our data 
is on the size of TiB we need PiB of storage.

 
{code:java}
I.e. instead of
Step 1:  32 blobs = A
Step 2: A + 31 blobs = B
Step 3: B + 31 blobs = C
 
The code would do
Step 1: Select 1st seen blob as the “temporary final” = A
A + 31 blobs = A
Step 2: A + 31 blobs = A
Etc.
{code}
 

{*}Optimization of the Recovery Mechanism{*}: Is it possible to optimize the 
recovery mechanism to avoid data duplication? Ideally, once the minor blobs are 
merged into the final blob, we could unregister (and delete) them from the 
recovery state to save space.

{*}Handling the 5 TB Blob Limit{*}: The current code doesn't seem to account 
for the 5 TB blob limit. A simple solution could be to initiate an additional 
final blob when nearing this limit.

{*}Rationale Behind Blob Composition{*}: What is the primary reason for 
composing multiple small blobs into a single large blob? Is it to optimize the 
reading process later, or is there another benefit? If the goal is to end up 
with a large final blob, might it be more efficient to append data directly to 
it instead of creating multiple smaller blobs first?

*TTL and temporary bucket* I like the idea of using the time-to-live setting on 
a temporary bucket to ensure that temporary blobs are removed, e.g., if the job 
fails. But I would not up front have an easy way to determine a good TTL. That 
seems difficult to estimate unless you already ran the job and have generated 
statistics for how quickly it runs etc. Also it should be documented that e.g. 
the new soft delete feature of GCS means the temporary objects by default will 
be stored (and paid for) for 7 days (unless you switch off this feature), and 
also that if you have versioning on appending to a blob will generate multiple 
versions (and you will pay for all of these versions)

 

Looking forward to your thoughts on these points.


was (Author: sisp):
Dear Galen,

Thank you for your detailed feedback. I have several suggestions and questions 
regarding the current implementation:

{*}Efficiency in Blob Composition{*}: Instead of creating temporary composite 
blobs, could we directly append all blobs to the final blob? This approach 
would reduce data duplication significantly, i.e. to only double the amount of 
data. Currently, the process involves creating multiple temporary blobs that in 
the final steps nearly match the size of the final blob, leading to an 
unnecessary increase in data size, especially when dealing with TiB-sized 
blobs.  In our case, e.g., we saw a more 4000 time increase in the data amount. 
We see about  4.5 million small blobs being created, in the end multiple TiB 
sized temporary composite blobs. This means that although our data is on the 
size of TiB we need PiB of storage.

 
{code:java}
I.e. instead of
Step 1:  32 blobs = A
Step 2: A + 31 blobs = B
Step 3: B + 31 blobs = C
 
The code would do
Step 1: Select 1st seen blob as the “temporary final” = A
A + 31 blobs = A
Step 2: A + 31 blobs = A
Etc.
{code}
 

{*}Optimization of the Recovery Mechanism{*}: Is it possible to optimize the 
recovery mechanism to avoid data duplication? Ideally, once the minor blobs are 
merged into the final blob, we could unregister (and delete) them from the 
recovery state to save space.

{*}Handling the 5 TB Blob Limit{*}: The current code doesn't seem to account 
for the 5 TB blob limit. A simple solution could be to initiate an additional 
final blob when nearing this limit.

{*}Rationale Behind Blob Composition{*}: What is the primary reason for 
composing multiple small blobs into a single large blob? Is it to optimize the 
reading process later, or is there another benefit? If the goal is to end up 
with a large final blob, might it be more efficient to append data directly to 
it instead of creating multiple smaller blobs first?

*TTL and temporary bucket* I like the idea of using the time-to-live setting on 
a temporary bucket to ensure that temporary blobs are removed, e.g., if the job 
fails. But I would not up front have an easy way to determine a good TTL. That 
seems difficult to estimate unless you already ran the job and have generated 
statistics for how quickly it runs etc. Also it should be documented that e.g. 
the new soft delete feature of GCS means the temporary objects by default will 
be stored (and paid for) for 7 days (unless you switch off this feature), and 
also that if you have versioning on appending to a blob will generate multiple 
versions (and you will pay for all of these versions)

 

Looking forward to your thoughts on these points.

> GSRecoverableWriterCommitter is generating excessive data blobs
> ---------------------------------------------------------------
>
>                 Key: FLINK-34696
>                 URL: https://issues.apache.org/jira/browse/FLINK-34696
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem
>            Reporter: Simon-Shlomo Poil
>            Priority: Major
>
> The `composeBlobs` method in 
> `org.apache.flink.fs.gs.writer.GSRecoverableWriterCommitter` is designed to 
> merge multiple small blobs into a single large blob using Google Cloud 
> Storage's compose method. This process is iterative, combining the result 
> from the previous iteration with 31 new blobs until all blobs are merged. 
> Upon completion of the composition, the method proceeds to remove the 
> temporary blobs.
> *Issue:*
> This methodology results in significant, unnecessary data storage consumption 
> during the blob composition process, incurring considerable costs due to 
> Google Cloud Storage pricing models.
> *Example to Illustrate the Problem:*
>  - Initial state: 64 blobs, each 1 GB in size (totaling 64 GB).
>  - After 1st step: 32 blobs are merged into a single blob, increasing total 
> storage to 96 GB (64 original + 32 GB new).
>  - After 2nd step: The newly created 32 GB blob is merged with 31 more blobs, 
> raising the total to 159 GB.
>  - After 3rd step: The final blob is merged, culminating in a total of 223 GB 
> to combine the original 64 GB of data. This results in an overhead of 159 GB.
> *Impact:*
> This inefficiency has a profound impact, especially at scale, where terabytes 
> of data can incur overheads in the petabyte range, leading to unexpectedly 
> high costs. Additionally, we have observed an increase in storage exceptions 
> thrown by the Google Storage library, potentially linked to this issue.
> *Suggested Solution:*
> To mitigate this problem, we propose modifying the `composeBlobs` method to 
> immediately delete source blobs once they have been successfully combined. 
> This change could significantly reduce data duplication and associated costs. 
> However, the implications for data recovery and integrity need careful 
> consideration to ensure that this optimization does not compromise the 
> ability to recover data in case of a failure during the composition process.
> *Steps to Reproduce:*
> 1. Initiate the blob composition process in an environment with a significant 
> number of blobs (e.g., 64 blobs of 1 GB each).
> 2. Observe the temporary increase in data storage as blobs are iteratively 
> combined.
> 3. Note the final amount of data storage used compared to the initial total 
> size of the blobs.
> *Expected Behavior:*
> The blob composition process should minimize unnecessary data storage use, 
> efficiently managing resources to combine blobs without generating excessive 
> temporary data overhead.
> *Actual Behavior:*
> The current implementation results in significant temporary increases in data 
> storage, leading to high costs and potential system instability due to 
> frequent storage exceptions.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to