Re: [I] [SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table [hudi]

2024-05-02 Thread via GitHub


TarunMootala commented on issue #11122:
URL: https://github.com/apache/hudi/issues/11122#issuecomment-2090765146

   @ad1happy2go 
   > Can you share the timeline? 
   Can you elaborate on this ? 
   
   > Do you know how many file groups are there in the clean instant?
   Are you referring to number of files in that particular cleaner run ? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table [hudi]

2024-05-02 Thread via GitHub


ad1happy2go commented on issue #11122:
URL: https://github.com/apache/hudi/issues/11122#issuecomment-2090370136

   @TarunMootala Can you share the timeline? Do you know how many file groups 
are there in the clean instant?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table [hudi]

2024-05-01 Thread via GitHub


TarunMootala commented on issue #11122:
URL: https://github.com/apache/hudi/issues/11122#issuecomment-2088570932

   @ad1happy2go,
   
   When AWS Glue encounters OOME it kills the JVM immediately. It could be 
reason for the error not being available in driver logs. However, the error is 
present in output logs which is same as given in overview. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table [hudi]

2024-05-01 Thread via GitHub


ad1happy2go commented on issue #11122:
URL: https://github.com/apache/hudi/issues/11122#issuecomment-2088379340

   @TarunMootala The size itself doesn't look so big. In the log I couldn't 
locate the error. Can you check once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table [hudi]

2024-04-30 Thread via GitHub


TarunMootala commented on issue #11122:
URL: https://github.com/apache/hudi/issues/11122#issuecomment-2087065842

   `.hoodie/` fold is 350 MB and it has 3435 files (this includes active and 
archival timelines)
   `.hoodie/archived/` is 327 MB and it has 695 files (only archival timelines)
   
   Attached driver logs
   
[log-events-viewer-result.csv](https://github.com/apache/hudi/files/15170629/log-events-viewer-result.csv)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table [hudi]

2024-04-30 Thread via GitHub


ad1happy2go commented on issue #11122:
URL: https://github.com/apache/hudi/issues/11122#issuecomment-2085754426

   @TarunMootala Can you check the size of the timeline files. Can you post the 
driver logs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table [hudi]

2024-04-30 Thread via GitHub


TarunMootala commented on issue #11122:
URL: https://github.com/apache/hudi/issues/11122#issuecomment-2085549745

   @ad1happy2go 
   Thanks for your inputs. 
   I don't think it was related to loading of archival timeline. When this 
error occurred, the first option I've tried is cleaning of archival timeline 
(.hoodie/archived/) and it didn't help. Only deleting (archive) few of the 
oldest Hudi metadata from Active timeline (.hoodie folder) and reducing 
`hoodie.keep.max.commits` helped to resolve the issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table [hudi]

2024-04-30 Thread via GitHub


ad1happy2go commented on issue #11122:
URL: https://github.com/apache/hudi/issues/11122#issuecomment-2085028466

   @phani482 Is it possible for you to upgrade Hudi version to 0.14.1 and check 
if you still see this issue. The other issue was related to loading of archival 
timeline in the sync which was fixed in later releases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table [hudi]

2024-04-29 Thread via GitHub


phani482 commented on issue #11122:
URL: https://github.com/apache/hudi/issues/11122#issuecomment-2083636756

   Same issue reported here in the past, which is still open for RCA 
https://github.com/apache/hudi/issues/7800
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] java.lang.OutOfMemoryError: Requested array size exceeds VM limit on data ingestion to COW table [hudi]

2024-04-29 Thread via GitHub


TarunMootala opened a new issue, #11122:
URL: https://github.com/apache/hudi/issues/11122

   **Describe the problem you faced**
   We have spark streaming job that reads data from an input stream and appends 
the data to a COW table partitioned on subject area. This streaming job has a 
batch internal of 120 seconds.
   
   Intermittently the job is failing with error 
   
   ```
   java.lang.OutOfMemoryError: Requested array size exceeds VM limit
   ```
   
   Debugged multiple failures, always failing at the stage `collect at 
HoodieSparkEngineContext.java:118 (CleanPlanActionExecutor)`
   
   
   **To Reproduce**
   
   No specific steps. 
   
   **Expected behavior**
   
   The job should commit the data successfully and continue with next micro 
batch. 
   
   **Environment Description**
   
   * Hudi version : 0.12.1 (Glue 4.0)
   
   * Spark version : Spark 3.3.0
   
   * Hive version : N/A
   
   * Hadoop version : N/A
   
   * Storage (HDFS/S3/GCS..) : S3 
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   We are not sure on the exact fix and root cause. However, the workaround 
(not ideal) is to manually delete (archive) few of the oldest Hudi metadata 
from Active timeline (`.hoodie` folder) and reduce `hoodie.keep.max.commits`. 
This is only working when we reduce max commits, and whenever the max commits 
are reduced it run perfectly for few months before failing again.
   
   Our requirement is to store 1500 commits to enable incremental query 
capability on last 2 days of changes. Initially we started with max commits of 
1500 and gradually came down to 400. 
   
   **Hudi Config**
   
   ```
   "hoodie.table.name": "table_name",
   "hoodie.datasource.write.keygenerator.type": "COMPLEX",
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.partitionpath.field": "entity_name",
   "hoodie.datasource.write.recordkey.field": 
"partition_key,sequence_number",
   "hoodie.datasource.write.precombine.field": 
"approximate_arrival_timestamp",
   "hoodie.datasource.write.operation": "insert",
   "hoodie.insert.shuffle.parallelism": 10,
   "hoodie.bulkinsert.shuffle.parallelism": 10,
   "hoodie.upsert.shuffle.parallelism": 10,
   "hoodie.delete.shuffle.parallelism": 10,
   "hoodie.metadata.enable": "false",
   "hoodie.datasource.hive_sync.use_jdbc": "false",
   "hoodie.datasource.hive_sync.enable": "false",
   "hoodie.datasource.hive_sync.database": "database_name",
   "hoodie.datasource.hive_sync.table": "table_name",
   "hoodie.datasource.hive_sync.partition_fields": "entity_name",
   "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   "hoodie.datasource.hive_sync.support_timestamp": "true",
   "hoodie.keep.min.commits": 450,  # to preserve commits for at 
least 2 days with processingTime="120 seconds"
   "hoodie.keep.max.commits": 480,  # to preserve commits for at 
least 2 days with processingTime="120 seconds"
   "hoodie.cleaner.commits.retained": 449,
   ```
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   ```
   java.lang.OutOfMemoryError: Requested array size exceeds VM limit
   ```
   
   Debugged multiple failure logs, always failing at the stage `collect at 
HoodieSparkEngineContext.java:118 (CleanPlanActionExecutor)`
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org