[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Fix Version/s: 0.12.2 (was: 0.13.0) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.12.2 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Component/s: writer-core > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4261: - Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/12/12, 2023-01-09 (was: 2022/08/22, 2022/09/05, 2022/10/18, 2022/12/12, 0.13.0 Final Sprint) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/12/12, 0.13.0 Final Sprint (was: 2022/08/22, 2022/09/05, 2022/10/18, 2022/12/12) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Priority: Critical (was: Blocker) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Fix Version/s: (was: 0.13.0) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4261: - Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/12/06 (was: 2022/08/22, 2022/09/05, 2022/10/18) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4261: - Sprint: 2022/08/22, 2022/09/05, 2022/10/18 (was: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/29) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/29 (was: 2022/08/22, 2022/09/05, 2022/10/18) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Sprint: 2022/08/22, 2022/09/05, 2022/10/18 (was: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/15) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4261: - Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/15 (was: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/01) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4261: - Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/01 (was: 2022/08/22, 2022/09/05, 2022/10/18) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Sprint: 2022/08/22, 2022/09/05, 2022/10/18 (was: 2022/08/22, 2022/09/05) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Sprint: 2022/08/22, 2022/09/05 (was: 2022/08/22, 2022/09/05, 2022/09/19) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4261: - Sprint: 2022/08/22, 2022/09/05, 2022/09/19 (was: 2022/08/22, 2022/09/05) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4261: - Sprint: 2022/08/22, 2022/09/05 (was: 2022/08/22) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4261: - Sprint: 2022/08/22 > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Summary: OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions (was: OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitino) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.12.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions
[ https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4261: -- Fix Version/s: 0.13.0 (was: 0.12.0) > OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of > partitions > - > > Key: HUDI-4261 > URL: https://issues.apache.org/jira/browse/HUDI-4261 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png > > > While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when > you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of > partitions (> 1000). > > This happens for the same reasons as HUDI-3883: every logical partition > (let's say we have N of these, equal to shuffling-parallelism in Hudi) > handled by Spark, (since no re-partitioning is done to align with the actual > partition-column) will likely have a record from every physical partition on > disk (let's say we have M of these). B/c of that every logical partition will > be writing into every physical one. > This will eventually produce > # M * N files in the table > # For every file in the table while writing Hudi will keep a "handle" in > memory which in turn will hold full buffer worth of Parquet data (until > flushed). > This ultimately leads to an OOM. > > !Screen Shot 2022-06-15 at 6.06.06 PM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)